ISO/IEC JTC1/SC22/WG5 N1924 Requirements for further coarray features John Reid Here is a draft set of requirements for further coarray features in the proposed TS. 1. Overall size S1. The complexity of the TS on further coarray features should be comparable with that of document N1858, from the point of view of both implementation and edits to the standard. This is the essence of Resolution G9 of the Garching meeting (see N1861). 2. Teams Teams provide a capability to restrict the image set of remote memory references, coarray allocations, and synchronizations to a subset of all the images of the program. This simplifies writing programs that involve segregated activities (parts of a climate model, for example) that might be more easily be written independently or may have already been written as independent programs. Teams also provide a mechanism for subdividing the computation for the sake of better performance (such as within local SMP domains). Finally, teams provide the capability to run routines (such as library routines) that have developed and tested without any team syntax on a subset of the images of a program." The simple team feature described in N1858 is not adequate since the cosubscripts map onto the image indices of the set of all images rather than those of the team. When code that has been developed without teams is run on several teams, the cosubscripts will need to be changed for each team. We are led to this team requirement: T1: When a subprogram without any team syntax is called on images executing as a team, it should execute on those images as if the program contained no other images. This has the following implications: 1. Image indices shall be relative to the team, starting at 1 and ending with the number of images in the team. 2. Collective activities, such as SYNC ALL, allocation and deallocation of coarrays, collective subroutine execution, and inquiry intrinsics such as THIS_IMAGE and NUM_IMAGES shall be relative to the team. I think it is generally agreed that any team will have been formed as a part of a larger "parent" team. We need to consider whether there is a requirement for access to other variables of the parent. This could be done by naming each team and extending the coarray syntax in some way, such as array[i,j]@team where the cosubscript mapping is that of the parent team. If we go this way, we would need to extend synchronization features, too. This was not part of the 12-136r2 proposal, which has these bullet points: - An image is always a member of some team, and a member of only one team at a time. - Access to variables on images outside the current team is not permitted. This leads to these alternative team requirements T2a: While an image executes a statement it shall be a member of one and only one team. Access to variables on images outside the current team is not permitted. T2b: While an image executes a statement it shall be a full member of one and only one team, but also have access through new syntax to the data and synchronization features of its ancestor teams. For the formation of teams, 12-136r2 has this bullet point: - New teams can be formed with a new statement (SPLIT TEAM, or FORM TEAM) that defines the specified teams. The aggregate number of images in the teams shall equal the number of images in the current team. The new teams are composed of images with consecutive image numbers in the current team. A team variable cannot be defined other than by execution of the statement used to form teams. The requirement "The new teams are composed of images with consecutive image numbers in the current team." seems too restrictive to me. It would be awkward for 2- and 3-d applications. I suggest this team requirement. T3: It should be possible to split a team into mutually exclusive subsets that are themselves teams. This should be dynamic in order to allow different groupings of images during different stages of execution. It is desirable to have a compact mechanism for specifying a large number of teams each of the same number of images. Further mechanisms are not needed. For changing teams, 12-136r2 has this bullet point: - A construct is provided to specify a new current team for the executing image. Possibilities are a WITH TEAM ... END WITH TEAM construct, or a SELECT TEAM ... END SELECT construct. .... which is rather too explicit for a requirement. How's this? T4: There shall be a mechanism for changing the current team, involving the synchronization of all members of the old team and the synchronization of all members of the new team. It should be via a construct so that it is apparent (both to the system and the programmer) where team execution begins and ends. To identify teams, how about T5: There shall be a type for variables identifying teams (probably a derived type defined in the intrinsic module ISO_FORTRAN_ENV). We also need: T6: In order for the feature to coexist with MPI, there needs to be a mechanism to find the image index relative to the set of all images of the program. This might best be done by adding an optional argument to IMAGE_INDEX that specifies the team. There is an issue with allocatable arrays that are allocated within a team. Can these be supported in symmetric memory or are they are like allocatable components of a coarray? They can be supported in memory that is symmetric for the team provided they are all deallocated when execution reverts to the parent team. So we have two alternative requirements T7a. It should be possible to support allocatable arrays that are allocated within a team in memory that is symmetric for the team. T7b. It need not be possible to support allocatable arrays that are allocated within a team in memory that is symmetric for the team. 3. Collectives C1. A collective subroutine is an intrinsic subroutine that is executed by a set of images; it has internal synchronization and performs a computation based on values on the images of the set. Collective subroutines offer the possibility of substantially more efficient execution of reduction operations than would be possible by non-expert programmers. Corresponding routines are widely used in MPI programs. For flexibility, there should be a subroutine based on a user-written procedure that applies the required operation to local variables. In addition, because they are often needed, there should be specific collective subroutines for SUM, MAX, and MIN. Forms that provide the result to just one image or to all the images involved should be provided. Beyond this, there should be a collective subroutine that broadcasts a value on one image to a set of images. 4. Additional intrinsic atomic subroutines A1. Atomic memory operations provide powerful tools for synchronization execution of activities among images without use of heavy-weight sync and lock statements. They can provide substantial performance advantages. The minimal set needed are for the compare-and-swap, fetch-and-atomic-integer-add, and atomic-bitwise-and-xor. Since they offer convenience and clearer functionality, the atomic and, or, and xor bitwise operations, and a simple swap operation should be included. For the integer add and bitwise logical operations, both the direct and "fetch-and" versions should be supplied. 5. Synchronization using events E1. There should be a mechanism to allow one-sided ordering of segments. For example, suppose image I executes successive segments I1 and I2 and image J executes successive segments J1 and J2; there might be a need for I1 to precede J2 without the need for J1 to precede I2. The NOTIFY and QUERY statements were proposed for Fortran 2008, but for matching the execution of a NOTIFY statement on one image with the execution of a QUERY statement on another image, the feature relied on the numbers of times the statements were executed on the images. This mechanism is not robust in the presence of segment reordering; for example, an image that would otherwise be idle might bring other work forward. The mechanism should be replaced by one that relies on the use of a data item (tag), accessible on all the images (or all the images involved if teams are adopted), to identify the event. The tagging aspect is important for employing this capability in a library routine in such a way that is hidden from, and does not interfere with the caller. Bill Long comments: Separately from this feature, but related, is a need/desire for a "put-with-notify" feature that is more fine-grained. It conceptually falls somewhere between full segment ordering (as with this event feature), and atomic operations. The specific interest is in a facility that transfers data from one image to another (the "put") followed by setting a flag on the target image (the "notify"). The target image can be assured that the data has arrived when the flag variable is seen to be set. On the sending side, the "put" is initiated but execution progresses without waiting for it to finish, assuming no other dependencies. It is potentially much faster than a put - sync memory - post event sequence because the only memory sync is on just that variable (not all memory activity initiated by the image, hence not segment ordering), and the restricted sync happens behind the scenes so the sending image does not have to wait to execute the "post event" part of the operation. This allows for better overlap of computation with communication, which is good for performance. The receiving image can detect that the data has not yet arrived and possibly do other work rather than be forced to wait at a particular point. [Note that this shares some of the goals (and, I assume, motivation) of the copy_async feature of the RICE compiler. ] 6. Parallel input-output Here are two candidate IO requirements: IO1. Fortran 2008 does not permit a file to be open on more than one image at the same time. This restriction should be lifted for direct-access files and suitable facilities defined. If teams are adopted, it should be possible to specify that a file is open on all images of a team. IO2. In Fortran 2008, whether a file with a given name is the same file on all images or varies from one image to another is processor dependent. A mechanism should be added to allow the programmer to specify which of these is intended. Malcolm Cohen has commented on IO1 as follows: "suitable" is an interesting word. It might be necessary or desirable for there to be some kind of restriction on the record size of such a file. There is also an interesting interaction with IO2. About which the less said the better. I presume that "restriction should be lifted for direct-access files" is missing words like "when specific syntax indicating the intent to do a shared open appears in the OPEN statement". Otherwise this proposal is going to be a very hard sell. Obvious other questions are: (a) whether the record position is shared or per-image; I suggest per-image. (b) whether reading/writing (or writing/writing) are permitted in unordered segments or whether synchronisation is required; I would suggest synchronisation should be required. (c) whether FLUSH is required to make the effect of any WRITE necessarily visible to a subsequent READ on another image; I would think that this is needed otherwise you need to flush all the shared-open files at every segment boundary. (d) whether ASYNCHRONOUS i/o is permitted on a shared file, and if so how it interacts between images, and whether the pending data transfer operation ids are shared; allowing ASYNC looks pretty big to me - hard enough to get the non-async cases right. (e) which OPEN specifiers are permitted to be different, and which ones are required to be the same on all participating images. (f) whether the processor is required to detect a sharing failure (for example, one machine in a cluster might have had its link to the fileserver unmounted by an overzealous sysadmin) or whether we're just going to punt on that and say "good luck". (g) whether the unit number is required to have the same value on each image or whether some other mechanism is used to match up files that are supposed to be shared between images. (h) whether NEWUNIT= is usable, and whether it is required to return the same value on different images (I would suggest Yes and No respectively for hopefully-obvious reasons). (i) whether the FILE=name is required to have the same value on each image, for example what if the fileserver mount is identified by the environment variable FILESERVER but the actual mount point might differ between machines in a cluster. (j) whether attempting to share-open a file is an implicit synchronization of all participating images, and whether the OPEN statement is required to be the same one; I would suggest Yes on both counts. Malcolm Cohen has commented on IO2 as follows: Even in the degenerate case where all the images are actually running on the same physical processor with the same operating system and file system, it is going to be operating system dependent whether a file with a given name on one image is going to be the same as a file with that name a nanosecond later on a different image or even on the same image. In general, one cannot make any claim about this. These difficulties are greatly magnified on clusters where individual processors are running individual operating systems, even if they are all copies of the same O.S. in theory, as the "remote mounts" might be organised differently. Not to mention that even on a single computer, some device files such as /dev/tty could well be different no matter how much the programmer "specifies" he wants them to be the same, and on a cluster there is no question of them being the same. Operating systems also sometimes make it infeasible to find out whether two files on different computers are actually the same file on some other computer, other than by writing a cryptographically secure code into the file on one computer and trying to read it back out on the other (even then we are only probabilistically certain). Furthermore, if the two computers have got the same file systems mounted at the relevant points, the programmer saying he wants "//fileserver/share/fred.dat" to be different files might well be out of luck - no matter how much he wants them to be different, if the operating system says they are the same that is going to be the end of the story. In fact I see no possibility of what the programmer specifies he wants having any actual effect on what he gets at all; if the systems have the same file name space, they are going to be the same file, and if they don't then they won't, and all the edge cases (device, other special files) are going to be similarly ineffective. Unless you are asking the Fortran processor itself to provide a distributed file system with set semantics... and we have seen in mainframe days just how popular having a file system that differs from the operating system's view of the file system was (viz massively unpopular). "Obviously" if the user does a shared-open (IO1) of the file he wants it to be the same file (whether it's the same name or not, see previous item) and if he does the normal Fortran OPEN he wants it to be a different file (since he's not allowed to have the same one open across multiple images). Other than this, I cannot see a plausible use for the suggestion.