ISO/IEC JTC1/SC22/WG5 N1883

      Comments on the contents of the TS on further coarray features

                           John Reid

                       30 September 2011

In N1868, I invited comments on the technical contents of the TS on further 
coarray features, in order that PL22.3 (J3) can begin to construct a draft 
requirements document during its meeting in October (10-14 October).

I asked the following question "Is the technical content of N1858 suitable 
for the TS on further coarray features?" in one of these ways. 

1) Yes.
2) Yes, but I recommend the following changes (please do not ask only 
for additions)

Here are the replies.


1. Reinhold Bader

I've talked with Uwe Kuester (HLRS), who has experience with the Cray 
implementation, as well as Tobias Burnus; both share my serious doubts that 
the technical content of N1858 is suitable (without a serious redesign
effort), which is why I'd like to register a NO answer. 

My answer is: NO, with comments

[I'm aware that this is not one of the alternatives I was allowed to 
 choose among, but after much consideration I still think this is the 
 appropriate answer].

Reasons for the NO vote and comments: 

(1) WG5/N1858 Coarray TS draft (Long) was extracted from a Fortran 
    2008 draft standard, among other reasons, because its technical 
    content was considered at least partially controversial. In its 
    present form, too many issues and open questions remain, as is 
    indicated by the fact that a lot of additional suggestions were 
    made in 
    WG5/N1835 Requirements for TR of further coarray features (Reid), 
    as well as 
    WG5/N1856 Addition/Modification of CAF Features (Authors from 
    Rice University). 
    These also reflect new knowledge obtained from a number of years of 
    research, which should be taken into account when re-designing the 
    basic ideas for the coarray extensions. 

(2) While I agree that the workload on J3 may be an issue, and that the 
    number and complexity of the features dealt with by the TS should be 
    limited, based on the observations in (1) I think it is unrealistic 
    to expect that the minimal reasonable feature set will only be as 
    big and complex as the one defined in N1858. My opinion is that the 
    correct way to deal with this situation is to 

    (a) set up, in a manner analogous to how 
        WG5/N1820 C Interoperability Objectives (Maclaren/Long)
        did for the interop TR, a document which describes the 
        objectives, including features, requirements, constraints and 
        excluded features for the coarray TS. It would be nice if the
        feature items on this list could be individually be voted on on 
        the WG5 level.

    (b) estimate the additional amount of work needed to get the work
        done by J3. If this takes longer than originally envisioned, 
        this is still a better situation than rapidly churning out a 
        badly designed coarray extension.

    In my opinion the sweet point is to have a feature set which is 
    approximately 30% bigger in complexity and workload on J3 as the
    one implied by N1858, but will provide a significantly higher 
    enhancement of programming productivity as well as performance 
    scalability to coarray Fortran users. The price to pay (probably 
    2-3 additional J3 meetings compared to the present schedule in
    WG5/N1859 Strategic plans for WG5 (Reid)) seems adequate; 
    having an overlap of at most one year with the startup of the work 
    for the next Fortran standard also seems acceptable. 

(3) I also do not consider it a good idea to freeze part of the features
    before all others, at least not unless the process suggested in (2a) 
    allows to determine that a particular feature decided there does not
    interact in any relevant way with a feature targeted for "early 
    release" (unlikely). This is also an argument against attempting to 
    split off parts of the coarray TS contents to be treated separately 
    in future Fortran extensions. 

.........................................................................

2. Uwe Kuester (kuester@hlrs.de)

Coarray Fortran enables the programmer to formulate single sided 
communication in a simple and intuitive way via the codimensions syntax.

Why is this important?
In a modern computer we see latencies for data access of various kinds.
These are memory and cache latencies, and much larger latencies in the 
interconnection network.
Latencies are hindering for obtaining good performance because they limit
the bandwidth that is actually reachable for small-size messages.
In a given architecture latencies cannot be reduced. They can be avoided
by concatenating a bunch of single data to a stream of data with a 
latency appearing only once at the begin of the stream.
Or they can be hidden behind other useful operations.
Requesting data by a consumer from a remote source typically means that
a latency appears twice, for sending the request (the remote address)
and transferring the data back. The advantage is that the consumer 
can consume the data directly after arrival.

a = b[remote_proc]

allows for the immediate use of a after this instruction. Unless the 
compiler can reorder the fetch of b[] to an earlier  point of time 
we have to wait for a long latency time assuming that b[] is already 
defined in the remote memory.

Using the opposite direction

a[target_proc] = b

would imply nearly no latency for the image where "b" is residing.
The image target_proc is paying by the uncertainty about when the 
data will arrive. A synchronizing call 

sync images([target_proc,remote_proc])

ensures that "a" can be used on target_proc. But it requires more 
than twice the latency.
A well formulated and well programmed parallel algorithm should 
contain as few synchronization points as possible to ensure high 
performance for a large number of active images.

The flow of information should go only in one direction in order to
decouple sender and receiver. This removes some latencies and 
allows for pipelining.

If the order of the information transfer is not changed, a special
trailer at the end of the transmitted data can inform the target 
processor about successful arrival of data.
The consuming processor may wait for the data or can do other 
work in the meantime.

That is the purpose of NOTIFY --> QUERY pairs.
The sending processor informs via NOTIFY that it has initiated the 
transmission and has transferred the data to the transmitting hardware.
The image target_proc recognizes the message as trailer of the data.

Without the NOTIFY --> QUERY dependence the one-sided communication 
capabilities of Coarray Fortran are not complete. Unwanted 
synchronization via "sync images" or "sync all" is needed.



Remark 1: 
Because the notifying image could proceed to another context and 
would produce other NOTIFYs in this new context for other purposes, 
it will be necessary to differentiate between the different contexts.


Remark 2:
QUERY([proc]) will wait and block image proc in the case that 
the image target_proc has not yet received the data.

This is very different from the behaviour QUERY([proc], READY=ready) 
which will neither block image target_proc nor image proc.

I would recommend differing names, e.g. BLOCKING_QUERY for the first 
case.

.........................................................................

3. Laksono Adhianto, Guohua Jin, John Mellor-Crummey, Karthik Murthy,
Dung Nguyen, William N. Scherer III, Scott Warren, and Chaoran Yang

Department of Computer Science, Rice University

{laksono, jin, johnmc, Karthik.S.Murthy, dxnguyen, scherer, scott,
chaoran}@rice.edu

23 September 2011

This document comprises the Rice University Coarray Fortran group's
response to ISO/IEC JTC1/SC22/WG5 N1868, Invitation to comment on the
contents of the TS on further coarray features, dated 8 July 2011.

In this document, commenters are asked to respond explicitly to the
question:

     Is the technical content of N1858 suitable for the 
     TS on further coarray features?

We believe that, to a large extent, the answer to this question is
yes.  While we have some reservations about the proposed definition of
features, we feel that coarray features enable one to build many
useful parallel Fortran applications.

In the remainder of this document, we provide specific criticism of
coarray features defined in N1858 based on our experiences designing,
developing, and using the Rice Coarray Fortran 2.0 protoype.

To summarize our response to N1858, we espouse the following:

Proposed additions:
 . TEAM_WORLD
 . TEAM_SIZE / TEAM_RANK
 . TEAM_DEFAULT and WITH TEAM
 . TEAM_BARRIER
 . TEAM_SPLIT

Proposed deletions:
 . FORM_TEAM
 . TEAM_IMAGES
 . SYNC TEAM
 . SYNC ALL
 . NUM_IMAGES


1. Missing Features

We understand that the goal of N1858 is to provide a small coherent
set of coarray features that enable developers to write reasonable
Fortran applications with coarray features.  We believe that a few
small additions to the features presented in N1858 will greatly expand
the possible applications that can be conveniently expressed.

1.1. TEAM_WORLD

We advocate pre-declaring a team variable representing the entire set
of images in the application.  This variable, which we have named
TEAM_WORLD in the Rice Coarray Fortran 2.0 prototype, greatly
simplifies documentability of the language.  

1.2. TEAM_DEFAULT and WITH TEAM

We advocate pre-declaring a team variable that represents the current
default team for any team-based operations.  At program start,
TEAM_DEFAULT is initialized to TEAM_WORLD. In the Rice CAF 2.0
prototype, we let developers change the default team via a dynamically
scoped, block-structured WITH TEAM statement.  For instance,

     WITH TEAM (ROW_TEAM)
      ... the value of TEAM_DEFAULT for all team operations
          within this block or anything it calls is ROW_TEAM.
     END WITH TEAM

1.3. TEAM_SIZE and TEAM_RANK

We have found the ability to query the number of members of a team
(TEAM_SIZE) and to query the current image's logical position within a
team (TEAM_RANK) to be indispensible for working with processor
subsets.  TEAM_SIZE(TEAM_WORLD) gives the total number of images.

In the current proposal, images index coarray data and synchronize
using absolute image numbers; additionally, image teams are described
by a set of absolute image numbers. As Sjkellum et al. [1] observe,
"libraries don't want to describe point-to-point communication using
hardware-dependent names, in fact, many algorithms are more natural if
described in terms of point-to-point calls relative to a virtual
topology naming scheme." They advocate for abstract names for
processors based on virtual topologies, or at least rank-in-group
names (i.e., image numbers relative to a subset of the executing
images).

For instance, synchronizing with one's successor in a ring consisting
of a processor subset is a trivial operation given TEAM_SIZE and
TEAM_RANK operations and the ability to index via rank-in-group
names.

1.4. User-defined reductions

We have found that supporting user-defined reductions greatly enhances
the expressiveness and utility of collectives in the Rice Coarray
Fortran 2.0 prototype, and recommend that they be supported in the
standard.  (It's unclear whether this is what Section A.2 in N1858 is
intended to represent; in any event, Section A.2 does not correspond
to the definition of CO_SUM in Section 4.3.10 in N1858.)  In our
prototype, we allow the use of Fortran 77 routines as explicit
reduction operators.

1.5. Additional collective operations

We argue that the following collective operations are important enough
to add to the list of those supported in the standard:
 . TEAM_BARRIER
 . TEAM_SCAN
 . TEAM_BROADCAST
 . TEAM_SCATTER, TEAM_GATHER, and TEAM_ALLGATHER
 . TEAM_ALLREDUCE
 . TEAM_ALLTOALL (personalized)
 . TEAM_SHIFT

We advocate the object-oriented TEAM_ prefix for collectives to show
that they are team-oriented and to ensure that these routines remain
adjacent both in the Fortran standard and in references based on it.

In general, it seems prudent to compare the list of supported
collectives and their semantics to those provided by MPI, which has
seen widespread adoption.

2. SYNC TEAM

We note that if a named TEAM_WORLD variable is defined as we advocate
in Section 1.1, SYNC TEAM (when applied to TEAM_WORLD) supplants SYNC
ALL.  As we note above, we propose using TEAM_BARRIER() instead of
SYNC TEAM as this is the name that a programmer is likely to look for.

3. NUM_IMAGES

This intrinsic should be removed as it is equivalent to calling
TEAM_SIZE(TEAM_WORLD).

4. NOTIFY/QUERY statements

We are deeply concerned that the design presented for NOTIFY and QUERY
does not provide a safe synchronization space.  If all
synchronization is tied directly to the image, there is no way for
application code to determine that a NOTIFY was intended for it as
opposed to a library routine, and vice versa.  In our experience, the
resulting confusion can extremely difficult to debug due to its
nondeterminism.

For example, if one is attempting to overlap computation with
synchronization latency, one cannot safely call a parallel library
routine because the library routine may itself use NOTIFY/QUERY
synchronization internally, and there would be no way to distinguish
between NOTIFYs intended for the user code from those intended for the
library code.

We believe instead that synchronization should be performed on
first-class EVENT variables.  It would also be convenient for the
names NOTIFY and QUERY to be replaced with EVENT_NOTIFY and EVENT_QUERY
so that they are directly adjacent within the standard and within
programmer references based on the standard.

We have found that it is occasionally useful to QUERY for more than
one notification to an event at a time. For example, in a scenario
wherein each node has an "incoming" event variable, waiting on all 4
neighbors to signal me in the 2D halo exchange of a stencil
computation could be a single call to EVENT_QUERY.  For this purpose,
we recommend adding a COUNT variable as an optional parameter to
EVENT_QUERY for this reason.  Symmetrically, one might consider adding
it to EVENT_NOTIFY, though the case to do so is less clear.

The language "for each different T in its image set" on line 4 of
Section 2.3 is ambiguous (and hard to interpret).  We recommend
changing it to "for each image T (T != M)" for clarity.

5. FORM_TEAM intrinsic procedure

It seems odd to us that team formation is not itself a collective
operation, when it's going to just be executing code as a member of
the team anyway.  Also, there is an inherent scalability problem when
making a list of images when dealing with a machine with huge numbers
of processors -- as will happen in the upcoming exascale era.  

In discussions with the MPI creators, they have found that
MPI_COMM_SPLIT is almost the only way that users create processor
subsets [personal communication]. It is easy to use and admits
scalable implementations of the operation and the representation on
each processor. In looking at how to adapt MPI for exascale systems,
MPI developers have explored scalable implementations of
MPI_COMM_SPLIT [2, 3] and found that our analagous implementation of
TEAM_SPLIT for Coarray Fortran 2.0 [4] scales exceptionally well.

For both scalability and simplicity of use, we advocate TEAM_SPLIT as
the mechanism for creating teams in Fortran. Initially, one would
apply TEAM_SPLIT to TEAM_WORLD to create subteams. The resulting teams
could be further subdivided as application requirements dictate. As
with MPI_COMM_SPLIT, TEAM_SPLIT would have each member of the current
(parent) team specify a color (which specifies the identity of a
subset team that will include this image), and rank (used to compute
relative order of this image within its subteam). If two or more
images in a subteam specify the same rank, then their order is
determined by their rank in the parent team. See the documentation of
MPI_COMM_SPLIT in the MPI standard for further details.

6. TEAM_IMAGES intrinsic procedure

This feature seems to be included solely to support creation of
subteams from an existing team. Judging from how infrequently
applications use the MPI_GROUP feature compared to MPI_COMM_SPLIT, we
believe that simply providing TEAM_SPLIT instead is the right approach
for both simplicity and scalability. Having an individual processor
possess a list of all of the images in a huge team is a potential
scalability issue.

7. References

[1] A. Skjellum, N. E. Doss, and P. V. Bangalore. Writing libraries in
MPI. In A. Skjellum and D. S. Reese, editors, Proceedings of the
Scalable Parallel Libraries Conference, pages 166-173. IEEE Computer
Society Press, October 1993.

[2] A. Moody, D. H. Ahn, and B. R. de Supinski. Exascale algorithms
for generalized MPI_Comm_Split. In EuroMPI 2011, 2011.

[3] P. Sack and W. Gropp. A scalable MPI_Comm_Split algorithm for
exascale computing. In R. Keller, E. Gabriel, M. Resch, and
J. Dongarra, editors, Recent Advances in the Message Passing
Interface, volume 6305 of Lecture Notes in Computer Science, pages
1-10. Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-15646-5 1.

[4] J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and G. Jin. A
new vision for coarray fortran. In PGAS '09: Proceedings of the Third
Conference on Partitioned Global Address Space Programing Models,
pages 1-9, New York, NY, USA, 2009.


Appendix A. Other Important Features

In this section, we discuss features that we view as critically
important for coarray-based Fortran appliations.  Although adding the
collection of features listed below would violate the "zero sum"
principle that the number of additions must balance the number of
deletions, we believe that it is important to add them to the
language.

A.1. Atomic Operations

As suggested by Bill Long in N1835, atomic operations should be added
to the language to leverage hardware support for direct
read-modify-update operations, both locally and across an
interconnect.  The list of operations to support includes:

 . atomic_cas
 . atomic_add
 . atomic_fadd
 . atomic_and
 . atomic_fand
 . atomic_or
 . atomic_for
 . atomic_xor
 . atomic_fxor

where 'f' indicates a fetch of the old value.

A.2 Predicated COPY_ASYNC

The predicated COPY_ASYNC operation in our CAF 2.0 prototype allows
one to explicitly overlap communication of arbitrary amounts of data
with computation, and to specify precisely when the copy may
begin, precisely when the source data may be overwritten, and
precisely when the destination data may be read.

Syntactically it is defined as: 

copy async(var_dest, var_src [, ev_dr] [, ev_cr] [, ev_sr])

where:
   var_dest is a coarray reference target
   var_src is  a coarray reference source
   ev_dr (aka destination ready) is an optional event indicating that
   the write to var dest is complete and var_dest can be read safely
   ev_cr (aka copy ready) is an optional event indicating that the
   copy can start
   ev_sr (aka source ready)is an optional event indicating that the
   read of var_src is complete and var_src can be overwritten safely

A.3. Asynchronous Collectives

Asynchronous versions of collective operations enable one to overlap
computation with the communication inherent in effecting the
collective.  Syntactically, asynchronous versions would have the same
format as their synchronous counterparts, but they would include one
additional parameter: an event to be notified upon completion of the
collective.  Calling the asynchronous collective would be a
non-blocking operation that establishes the collective and returns;
when the results of the collective are needed a call to EVENT_QUERY
would block until completion.

A.4 Copointers

CAF 2.0 adds global pointers to the Fortran language which support
references to remote coarray sections as well as distributed linked
data structures.  The definition and use of these new "copointers" is
intentionally similar to ordinary Fortran pointers: They are declared
with new attributes analogous to 'pointer' and 'target', manipulated
with the existing '=>' pointer assignment statement, and inspected
with the existing pointer intrinsics.  Accessing data via copointers
is similar to existing coarray access, with implicit access to the
local image and explicit access to remote images using a
square-bracket notation. CAF 2.0's copointers may point to values of
any type, including coarrays.  Setting up copointers once in the
initialization section of a program can lead to dramatically simpler
reading and updating of halo regions on neighboring processors.

The following code fragment illustrates an example usage of copointers.

   integer, dimension(:), copointer :: p7, p8  ! copointer to array 
                                               ! of integer
   integer, dimension(10), cotarget :: a3[*]   ! coarray of array of 
                                               ! integer
   p7 => a3                                    ! copointer to a3's 
                                               ! local coarray section
   p8 => a3[9]                                 ! copointer to a3 on 
                                               ! image 9
   p7(6) = 1                                   ! assigns 6th element
                                               ! of the local section
                                               ! of a3
   p8(6)[] = 42                                ! assigns 6th element
                                               ! of the target remote
                                               ! coarray

We note that accessing remote data via copointers remains explicit;
this conforms to the spirit of coarray Fortran extensions in
maintaining visual cues to mark remote operations.

Further details of copointers can be provided upon request.

...........................................................................

4. Robert Numrich 

Robert telephoned me, saying that he would like the extension in the TS
to be small. He supports the revised set of collectives proposed by
Bill Long in N1835, but without the team argument.

He asked for three of his comments in N1835 to be reiterated:

a. The intrinsic function this_image()

The function this_image should allow a scalar return value for coarray 
arguments with just one codimension:

integer :: me
real    :: x[*]

me = this_image(x)

Internally the function may continue to think it is returning an array of 
length one, but the programmer should not be penalized for that.  Let the 
value on the left side of the assignment statement be a scalar.  At most, 
issue a warning at compile time.  I hit this problem every time I write 
new code.  It is embarrassing trying to explain it to a new coarray 
programmer.

f.  Teams

Remove Teams completely from the proposed extensions. 

[For his rationale, see N1835.]

g.  Notify/Query

We should hasten slowly with these statements.  The current definition is 
probably wrong. There probably needs to be some sort of tag associated 
with these statements making them look more like events.

...........................................................................

5. Nick Maclaren

NO, with comments.

I have not had time to think about this area, but I agree with Reinhold
Bader and Robert Numrich, for similar reasons to them.  I have rechecked
the WG5 Garching minutes and, while we did not formally agree to Reinhold's
point (2), I recall there being a consensus that it was a necessary step
within J3.  From my experience of MPI and OpenMP, Robert has pointed out
the two most difficult parts of N1858 to specify semantically, unless we
both have missed something subtle.

We need to go cautiously, and watch out for the semantics.  Specifying
the syntax is the easy bit.

...........................................................................

6. Tobias Burnus 

As a Fortran user and Fortran-compiler developer, I am in favour of a 
smaller TS. Given that the number of compilers well supporting coarrays (as 
defined in Fortran 2008) is very low, the practical experience with the 
newer features is still rather low. Thus, adding the most important 
missing features as TS allows to draw on user experience for additional 
features during the F201{3,8} development.

For me, the collective/broadcast feature is the most important omission; 
having some subdivision (teams) or collective I/O seems to be of lesser 
importance (for my projects at least).