ISO/IEC JTC1/SC22/WG5 N1924

             Requirements for further coarray features

                         John Reid

Here is a draft set of requirements for further coarray features in the 
proposed TS.

1. Overall size

S1. The complexity of the TS on further coarray features should be 
comparable with that of document N1858, from the point of view of
both implementation and edits to the standard. This is the essence of 
Resolution G9 of the Garching meeting (see N1861). 

2. Teams

Teams provide a capability to restrict the image set of remote memory
references, coarray allocations, and synchronizations to a subset of
all the images of the program. This simplifies writing programs that
involve segregated activities (parts of a climate model, for example)
that might be more easily be written independently or may have
already been written as independent programs. Teams also provide
a mechanism for subdividing the computation for the sake of better
performance (such as within local SMP domains). Finally, teams provide
the capability to run routines (such as library routines) that have
developed and tested without any team syntax on a subset of the
images of a program."

The simple team feature described in N1858 is not adequate since the 
cosubscripts map onto the image indices of the set of all images 
rather than those of the team. When code that has been developed 
without teams is run on several teams, the cosubscripts will need to 
be changed for each team. We are led to this team requirement:

T1: When a subprogram without any team syntax is called on
    images executing as a team, it should execute on those images as
    if the program contained no other images. This has the following
    implications:
    1. Image indices shall be relative to the team, starting at 1 and
       ending with the number of images in the team.
    2. Collective activities, such as SYNC ALL, allocation and
       deallocation of coarrays, collective subroutine execution, and
       inquiry intrinsics such as THIS_IMAGE and NUM_IMAGES shall be
       relative to the team.

I think it is generally agreed that any team will have been formed as a 
part of a larger "parent" team. We need to consider whether there is a 
requirement for access to other variables of the parent. This could be 
done by naming each team and extending the coarray syntax in some way, 
such as
      array[i,j]@team
where the cosubscript mapping is that of the parent team. If we go this 
way, we would need to extend synchronization features, too. This was not 
part of the 12-136r2 proposal, which has these bullet points:

- An image is always a member of some team, and a member of only one
  team at a time.
- Access to variables on images outside the current team is not
  permitted.

This leads to these alternative team requirements

T2a: While an image executes a statement it shall be a member of one
    and only one team. Access to variables on images outside the
    current team is not permitted.

T2b: While an image executes a statement it shall be a full member of
    one and only one team, but also have access through new syntax to
    the data and synchronization features of its ancestor teams.

For the formation of teams, 12-136r2 has this bullet point:

- New teams can be formed with a new statement (SPLIT TEAM, or FORM
  TEAM) that defines the specified teams. The aggregate number of
  images in the teams shall equal the number of images in the current
  team. The new teams are composed of images with consecutive image
  numbers in the current team. A team variable cannot be defined other
  than by execution of the statement used to form teams.

The requirement "The new teams are composed of images with consecutive 
image numbers in the current team." seems too restrictive to me. It 
would be awkward for 2- and 3-d applications. I suggest this team 
requirement.

T3: It should be possible to split a team into mutually exclusive
    subsets that are themselves teams. This should be dynamic in
    order to allow different groupings of images during different
    stages of execution. It is desirable to have a
    compact mechanism for specifying a large number of teams each of
    the same number of images. Further mechanisms are not needed.

For changing teams, 12-136r2 has this bullet point:

- A construct is provided to specify a new current team for the
  executing image. Possibilities are a WITH TEAM ... END WITH TEAM
  construct, or a SELECT TEAM ... END SELECT construct. ....

which is rather too explicit for a requirement. How's this?

T4: There shall be a mechanism for changing the current team, involving
    the synchronization of all members of the old team and the
    synchronization of all members of the new team. It should be via a
    construct so that it is apparent (both to the system and the
    programmer) where team execution begins and ends.

To identify teams, how about

T5: There shall be a type for variables identifying teams (probably
    a derived type defined in the intrinsic module ISO_FORTRAN_ENV).

We also need:

T6: In order for the feature to coexist with MPI, there needs to be
    a mechanism to find the image index relative to the set of all
    images of the program.

This might best be done by adding an optional argument to IMAGE_INDEX
that specifies the team.

There is an issue with allocatable arrays that are allocated within a 
team. Can these be supported in symmetric memory or are they are like 
allocatable components of a coarray? They can be supported in memory 
that is symmetric for the team provided they are all deallocated when 
execution reverts to the parent team. So we have two alternative
requirements

T7a. It should be possible to support allocatable arrays that are 
allocated within a team in memory that is symmetric for the team.

T7b. It need not be possible to support allocatable arrays that are 
allocated within a team in memory that is symmetric for the team.

3. Collectives

C1. A collective subroutine is an intrinsic subroutine that is executed
by a set of images; it has internal synchronization and performs a
computation based on values on the images of the set. Collective
subroutines offer the possibility of substantially more
efficient execution of reduction operations than would be possible by
non-expert programmers. Corresponding routines are widely used in MPI
programs. For flexibility, there should be a subroutine based on a
user-written procedure that applies the required operation to local
variables. In addition, because they are often needed, there
should be specific collective subroutines for SUM, MAX, and MIN.
Forms that provide the result to just one image or to all the images
involved should be provided. Beyond this, there should be a collective
subroutine that broadcasts a value on one image to a set of images. 


4. Additional intrinsic atomic subroutines

A1. Atomic memory operations provide powerful tools for synchronization 
execution of activities among images without use of heavy-weight sync 
and lock statements. They can provide substantial performance advantages.  
The minimal set needed are for the  compare-and-swap, 
fetch-and-atomic-integer-add, and atomic-bitwise-and-xor.  Since they 
offer convenience and clearer functionality, the atomic and, or, and 
xor bitwise operations, and a simple swap operation should be included. 
For the integer add and bitwise logical operations, both the direct and 
"fetch-and" versions should be supplied.

5. Synchronization using events

E1. There should be a mechanism to allow one-sided ordering of
segments. For example, suppose image I executes successive segments
I1 and I2 and image J executes successive segments J1 and J2; there
might be a need for I1 to precede J2 without the need for J1 to
precede I2.

The NOTIFY and QUERY statements were proposed for Fortran 2008, but
for matching the execution of a NOTIFY statement on one image with
the execution of a QUERY statement on another image, the feature
relied on the numbers of times the statements were executed on the
images. This mechanism is not robust in the presence of segment
reordering; for example, an image that would otherwise be idle might
bring other work forward. The mechanism should be replaced by one
that relies on the use of a data item (tag), accessible on all the 
images (or all the images involved if teams are adopted), to identify 
the event. The tagging aspect is important for employing this capability 
in a library routine in such a way that is hidden from, and does not 
interfere with the caller.

Bill Long comments:
Separately from this feature, but related, is a need/desire for a 
"put-with-notify"  feature that is more fine-grained.  It conceptually 
falls somewhere between full segment ordering (as with this event 
feature), and atomic operations.    The specific interest is in a 
facility that transfers data from one image to another (the "put") 
followed by setting a flag on the target image (the "notify").  The 
target image can be assured that the data has arrived when the flag 
variable is seen to be set.  On the sending side, the "put" is initiated 
but execution progresses without waiting for it to finish, assuming no 
other dependencies.  It is potentially much faster than a put - sync 
memory - post event sequence because the only memory sync is on just 
that variable (not all memory activity initiated by the image, hence not 
segment ordering), and the restricted sync happens behind the scenes so 
the sending image does not have to wait to execute the "post event" part 
of the operation.  This allows for better overlap of computation with 
communication, which is good for performance.  The receiving image can 
detect that the data has not yet arrived and possibly do other work 
rather than be forced to wait at a particular point.   [Note that this 
shares some of the goals (and, I assume, motivation) of the copy_async 
feature of the RICE compiler. ]


6. Parallel input-output

Here are two candidate IO requirements:

IO1. Fortran 2008 does not permit a file to be open on more than
one image at the same time. This restriction should be lifted for
direct-access files and suitable facilities defined. If teams are
adopted, it should be possible to specify that a file is open on
all images of a team.

IO2. In Fortran 2008, whether a file with a given name is the same
file on all images or varies from one image to another is
processor dependent. A mechanism should be added to allow the
programmer to specify which of these is intended.

Malcolm Cohen has commented on IO1 as follows:

"suitable" is an interesting word.  It might be necessary or desirable 
for there to be some kind of restriction on the record size of such a 
file.  There is also an interesting interaction with IO2.  About which 
the less said the better.

I presume that "restriction should be lifted for direct-access files" 
is missing words like "when specific syntax indicating the intent to do 
a shared open appears in the OPEN statement".  Otherwise this proposal 
is going to be a very hard sell.

Obvious other questions are:
(a) whether the record position is shared or per-image; I suggest per-image.
(b) whether reading/writing (or writing/writing) are permitted in unordered 
segments or whether synchronisation is required; I would suggest 
synchronisation should be required.
(c) whether FLUSH is required to make the effect of any WRITE necessarily 
visible to a subsequent READ on another image; I would think that this is 
needed otherwise you need to flush all the shared-open files at every segment 
boundary.
(d) whether ASYNCHRONOUS i/o is permitted on a shared file, and if so how it 
interacts between images, and whether the pending data transfer operation ids 
are shared; allowing ASYNC looks pretty big to me - hard enough to get the 
non-async cases right.
(e) which OPEN specifiers are permitted to be different, and which ones are 
required to be the same on all participating images.
(f) whether the processor is required to detect a sharing failure (for example, 
one machine in a cluster might have had its link to the fileserver unmounted by 
an overzealous sysadmin) or whether we're just going to punt on that and say 
"good luck".
(g) whether the unit number is required to have the same value on each image or 
whether some other mechanism is used to match up files that are supposed to be 
shared between images.
(h) whether NEWUNIT= is usable, and whether it is required to return the same 
value on different images (I would suggest Yes and No respectively for 
hopefully-obvious reasons).
(i) whether the FILE=name is required to have the same value on each image, for 
example what if the fileserver mount is identified by the environment variable 
FILESERVER but the actual mount point might differ between machines in a 
cluster.
(j) whether attempting to share-open a file is an implicit synchronization of 
all participating images, and whether the OPEN statement is required to be the 
same one; I would suggest Yes on both counts.

Malcolm Cohen has commented on IO2 as follows:

Even in the degenerate case where all the images are actually running on the 
same physical processor with the same operating system and file system, it is 
going to be operating system dependent whether a file with a given name on one 
image is going to be the same as a file with that name a nanosecond later on a 
different image or even on the same image.  In general, one cannot make any 
claim about this.  These difficulties are greatly magnified on clusters where 
individual processors are running individual operating systems, even if they are 
all copies of the same O.S. in theory, as the "remote mounts" might be organised 
differently.  Not to mention that even on a single computer, some device files 
such as /dev/tty could well be different no matter how much the programmer 
"specifies" he wants them to be the same, and on a cluster there is no question 
of them being the same.

Operating systems also sometimes make it infeasible to find out whether two 
files on different computers are actually the same file on some other computer, 
other than by writing a cryptographically secure code into the file on one 
computer and trying to read it back out on the other (even then we are only 
probabilistically certain).

Furthermore, if the two computers have got the same file systems mounted at the 
relevant points, the programmer saying he wants "//fileserver/share/fred.dat" to 
be different files might well be out of luck - no matter how much he wants them 
to be different, if the operating system says they are the same that is going to 
be the end of the story.

In fact I see no possibility of what the programmer specifies he wants having 
any actual effect on what he gets at all; if the systems have the same file name 
space, they are going to be the same file, and if they don't then they won't, 
and all the edge cases (device, other special files) are going to be similarly 
ineffective.  Unless you are asking the Fortran processor itself to provide a 
distributed file system with set semantics... and we have seen in mainframe days 
just how popular having a file system that differs from the operating system's 
view of the file system was (viz massively unpopular).

"Obviously" if the user does a shared-open (IO1) of the file he wants it to be 
the same file (whether it's the same name or not, see previous item) and if he 
does the normal Fortran OPEN he wants it to be a different file (since he's not 
allowed to have the same one open across multiple images).  Other than this, I 
cannot see a plausible use for the suggestion.