ISO/IEC JTC1/SC22/WG5 N1883 Comments on the contents of the TS on further coarray features John Reid 30 September 2011 In N1868, I invited comments on the technical contents of the TS on further coarray features, in order that PL22.3 (J3) can begin to construct a draft requirements document during its meeting in October (10-14 October). I asked the following question "Is the technical content of N1858 suitable for the TS on further coarray features?" in one of these ways. 1) Yes. 2) Yes, but I recommend the following changes (please do not ask only for additions) Here are the replies. 1. Reinhold Bader I've talked with Uwe Kuester (HLRS), who has experience with the Cray implementation, as well as Tobias Burnus; both share my serious doubts that the technical content of N1858 is suitable (without a serious redesign effort), which is why I'd like to register a NO answer. My answer is: NO, with comments [I'm aware that this is not one of the alternatives I was allowed to choose among, but after much consideration I still think this is the appropriate answer]. Reasons for the NO vote and comments: (1) WG5/N1858 Coarray TS draft (Long) was extracted from a Fortran 2008 draft standard, among other reasons, because its technical content was considered at least partially controversial. In its present form, too many issues and open questions remain, as is indicated by the fact that a lot of additional suggestions were made in WG5/N1835 Requirements for TR of further coarray features (Reid), as well as WG5/N1856 Addition/Modification of CAF Features (Authors from Rice University). These also reflect new knowledge obtained from a number of years of research, which should be taken into account when re-designing the basic ideas for the coarray extensions. (2) While I agree that the workload on J3 may be an issue, and that the number and complexity of the features dealt with by the TS should be limited, based on the observations in (1) I think it is unrealistic to expect that the minimal reasonable feature set will only be as big and complex as the one defined in N1858. My opinion is that the correct way to deal with this situation is to (a) set up, in a manner analogous to how WG5/N1820 C Interoperability Objectives (Maclaren/Long) did for the interop TR, a document which describes the objectives, including features, requirements, constraints and excluded features for the coarray TS. It would be nice if the feature items on this list could be individually be voted on on the WG5 level. (b) estimate the additional amount of work needed to get the work done by J3. If this takes longer than originally envisioned, this is still a better situation than rapidly churning out a badly designed coarray extension. In my opinion the sweet point is to have a feature set which is approximately 30% bigger in complexity and workload on J3 as the one implied by N1858, but will provide a significantly higher enhancement of programming productivity as well as performance scalability to coarray Fortran users. The price to pay (probably 2-3 additional J3 meetings compared to the present schedule in WG5/N1859 Strategic plans for WG5 (Reid)) seems adequate; having an overlap of at most one year with the startup of the work for the next Fortran standard also seems acceptable. (3) I also do not consider it a good idea to freeze part of the features before all others, at least not unless the process suggested in (2a) allows to determine that a particular feature decided there does not interact in any relevant way with a feature targeted for "early release" (unlikely). This is also an argument against attempting to split off parts of the coarray TS contents to be treated separately in future Fortran extensions. ......................................................................... 2. Uwe Kuester (kuester@hlrs.de) Coarray Fortran enables the programmer to formulate single sided communication in a simple and intuitive way via the codimensions syntax. Why is this important? In a modern computer we see latencies for data access of various kinds. These are memory and cache latencies, and much larger latencies in the interconnection network. Latencies are hindering for obtaining good performance because they limit the bandwidth that is actually reachable for small-size messages. In a given architecture latencies cannot be reduced. They can be avoided by concatenating a bunch of single data to a stream of data with a latency appearing only once at the begin of the stream. Or they can be hidden behind other useful operations. Requesting data by a consumer from a remote source typically means that a latency appears twice, for sending the request (the remote address) and transferring the data back. The advantage is that the consumer can consume the data directly after arrival. a = b[remote_proc] allows for the immediate use of a after this instruction. Unless the compiler can reorder the fetch of b[] to an earlier point of time we have to wait for a long latency time assuming that b[] is already defined in the remote memory. Using the opposite direction a[target_proc] = b would imply nearly no latency for the image where "b" is residing. The image target_proc is paying by the uncertainty about when the data will arrive. A synchronizing call sync images([target_proc,remote_proc]) ensures that "a" can be used on target_proc. But it requires more than twice the latency. A well formulated and well programmed parallel algorithm should contain as few synchronization points as possible to ensure high performance for a large number of active images. The flow of information should go only in one direction in order to decouple sender and receiver. This removes some latencies and allows for pipelining. If the order of the information transfer is not changed, a special trailer at the end of the transmitted data can inform the target processor about successful arrival of data. The consuming processor may wait for the data or can do other work in the meantime. That is the purpose of NOTIFY --> QUERY pairs. The sending processor informs via NOTIFY that it has initiated the transmission and has transferred the data to the transmitting hardware. The image target_proc recognizes the message as trailer of the data. Without the NOTIFY --> QUERY dependence the one-sided communication capabilities of Coarray Fortran are not complete. Unwanted synchronization via "sync images" or "sync all" is needed. Remark 1: Because the notifying image could proceed to another context and would produce other NOTIFYs in this new context for other purposes, it will be necessary to differentiate between the different contexts. Remark 2: QUERY([proc]) will wait and block image proc in the case that the image target_proc has not yet received the data. This is very different from the behaviour QUERY([proc], READY=ready) which will neither block image target_proc nor image proc. I would recommend differing names, e.g. BLOCKING_QUERY for the first case. ......................................................................... 3. Laksono Adhianto, Guohua Jin, John Mellor-Crummey, Karthik Murthy, Dung Nguyen, William N. Scherer III, Scott Warren, and Chaoran Yang Department of Computer Science, Rice University {laksono, jin, johnmc, Karthik.S.Murthy, dxnguyen, scherer, scott, chaoran}@rice.edu 23 September 2011 This document comprises the Rice University Coarray Fortran group's response to ISO/IEC JTC1/SC22/WG5 N1868, Invitation to comment on the contents of the TS on further coarray features, dated 8 July 2011. In this document, commenters are asked to respond explicitly to the question: Is the technical content of N1858 suitable for the TS on further coarray features? We believe that, to a large extent, the answer to this question is yes. While we have some reservations about the proposed definition of features, we feel that coarray features enable one to build many useful parallel Fortran applications. In the remainder of this document, we provide specific criticism of coarray features defined in N1858 based on our experiences designing, developing, and using the Rice Coarray Fortran 2.0 protoype. To summarize our response to N1858, we espouse the following: Proposed additions: . TEAM_WORLD . TEAM_SIZE / TEAM_RANK . TEAM_DEFAULT and WITH TEAM . TEAM_BARRIER . TEAM_SPLIT Proposed deletions: . FORM_TEAM . TEAM_IMAGES . SYNC TEAM . SYNC ALL . NUM_IMAGES 1. Missing Features We understand that the goal of N1858 is to provide a small coherent set of coarray features that enable developers to write reasonable Fortran applications with coarray features. We believe that a few small additions to the features presented in N1858 will greatly expand the possible applications that can be conveniently expressed. 1.1. TEAM_WORLD We advocate pre-declaring a team variable representing the entire set of images in the application. This variable, which we have named TEAM_WORLD in the Rice Coarray Fortran 2.0 prototype, greatly simplifies documentability of the language. 1.2. TEAM_DEFAULT and WITH TEAM We advocate pre-declaring a team variable that represents the current default team for any team-based operations. At program start, TEAM_DEFAULT is initialized to TEAM_WORLD. In the Rice CAF 2.0 prototype, we let developers change the default team via a dynamically scoped, block-structured WITH TEAM statement. For instance, WITH TEAM (ROW_TEAM) ... the value of TEAM_DEFAULT for all team operations within this block or anything it calls is ROW_TEAM. END WITH TEAM 1.3. TEAM_SIZE and TEAM_RANK We have found the ability to query the number of members of a team (TEAM_SIZE) and to query the current image's logical position within a team (TEAM_RANK) to be indispensible for working with processor subsets. TEAM_SIZE(TEAM_WORLD) gives the total number of images. In the current proposal, images index coarray data and synchronize using absolute image numbers; additionally, image teams are described by a set of absolute image numbers. As Sjkellum et al. [1] observe, "libraries don't want to describe point-to-point communication using hardware-dependent names, in fact, many algorithms are more natural if described in terms of point-to-point calls relative to a virtual topology naming scheme." They advocate for abstract names for processors based on virtual topologies, or at least rank-in-group names (i.e., image numbers relative to a subset of the executing images). For instance, synchronizing with one's successor in a ring consisting of a processor subset is a trivial operation given TEAM_SIZE and TEAM_RANK operations and the ability to index via rank-in-group names. 1.4. User-defined reductions We have found that supporting user-defined reductions greatly enhances the expressiveness and utility of collectives in the Rice Coarray Fortran 2.0 prototype, and recommend that they be supported in the standard. (It's unclear whether this is what Section A.2 in N1858 is intended to represent; in any event, Section A.2 does not correspond to the definition of CO_SUM in Section 4.3.10 in N1858.) In our prototype, we allow the use of Fortran 77 routines as explicit reduction operators. 1.5. Additional collective operations We argue that the following collective operations are important enough to add to the list of those supported in the standard: . TEAM_BARRIER . TEAM_SCAN . TEAM_BROADCAST . TEAM_SCATTER, TEAM_GATHER, and TEAM_ALLGATHER . TEAM_ALLREDUCE . TEAM_ALLTOALL (personalized) . TEAM_SHIFT We advocate the object-oriented TEAM_ prefix for collectives to show that they are team-oriented and to ensure that these routines remain adjacent both in the Fortran standard and in references based on it. In general, it seems prudent to compare the list of supported collectives and their semantics to those provided by MPI, which has seen widespread adoption. 2. SYNC TEAM We note that if a named TEAM_WORLD variable is defined as we advocate in Section 1.1, SYNC TEAM (when applied to TEAM_WORLD) supplants SYNC ALL. As we note above, we propose using TEAM_BARRIER() instead of SYNC TEAM as this is the name that a programmer is likely to look for. 3. NUM_IMAGES This intrinsic should be removed as it is equivalent to calling TEAM_SIZE(TEAM_WORLD). 4. NOTIFY/QUERY statements We are deeply concerned that the design presented for NOTIFY and QUERY does not provide a safe synchronization space. If all synchronization is tied directly to the image, there is no way for application code to determine that a NOTIFY was intended for it as opposed to a library routine, and vice versa. In our experience, the resulting confusion can extremely difficult to debug due to its nondeterminism. For example, if one is attempting to overlap computation with synchronization latency, one cannot safely call a parallel library routine because the library routine may itself use NOTIFY/QUERY synchronization internally, and there would be no way to distinguish between NOTIFYs intended for the user code from those intended for the library code. We believe instead that synchronization should be performed on first-class EVENT variables. It would also be convenient for the names NOTIFY and QUERY to be replaced with EVENT_NOTIFY and EVENT_QUERY so that they are directly adjacent within the standard and within programmer references based on the standard. We have found that it is occasionally useful to QUERY for more than one notification to an event at a time. For example, in a scenario wherein each node has an "incoming" event variable, waiting on all 4 neighbors to signal me in the 2D halo exchange of a stencil computation could be a single call to EVENT_QUERY. For this purpose, we recommend adding a COUNT variable as an optional parameter to EVENT_QUERY for this reason. Symmetrically, one might consider adding it to EVENT_NOTIFY, though the case to do so is less clear. The language "for each different T in its image set" on line 4 of Section 2.3 is ambiguous (and hard to interpret). We recommend changing it to "for each image T (T != M)" for clarity. 5. FORM_TEAM intrinsic procedure It seems odd to us that team formation is not itself a collective operation, when it's going to just be executing code as a member of the team anyway. Also, there is an inherent scalability problem when making a list of images when dealing with a machine with huge numbers of processors -- as will happen in the upcoming exascale era. In discussions with the MPI creators, they have found that MPI_COMM_SPLIT is almost the only way that users create processor subsets [personal communication]. It is easy to use and admits scalable implementations of the operation and the representation on each processor. In looking at how to adapt MPI for exascale systems, MPI developers have explored scalable implementations of MPI_COMM_SPLIT [2, 3] and found that our analagous implementation of TEAM_SPLIT for Coarray Fortran 2.0 [4] scales exceptionally well. For both scalability and simplicity of use, we advocate TEAM_SPLIT as the mechanism for creating teams in Fortran. Initially, one would apply TEAM_SPLIT to TEAM_WORLD to create subteams. The resulting teams could be further subdivided as application requirements dictate. As with MPI_COMM_SPLIT, TEAM_SPLIT would have each member of the current (parent) team specify a color (which specifies the identity of a subset team that will include this image), and rank (used to compute relative order of this image within its subteam). If two or more images in a subteam specify the same rank, then their order is determined by their rank in the parent team. See the documentation of MPI_COMM_SPLIT in the MPI standard for further details. 6. TEAM_IMAGES intrinsic procedure This feature seems to be included solely to support creation of subteams from an existing team. Judging from how infrequently applications use the MPI_GROUP feature compared to MPI_COMM_SPLIT, we believe that simply providing TEAM_SPLIT instead is the right approach for both simplicity and scalability. Having an individual processor possess a list of all of the images in a huge team is a potential scalability issue. 7. References [1] A. Skjellum, N. E. Doss, and P. V. Bangalore. Writing libraries in MPI. In A. Skjellum and D. S. Reese, editors, Proceedings of the Scalable Parallel Libraries Conference, pages 166-173. IEEE Computer Society Press, October 1993. [2] A. Moody, D. H. Ahn, and B. R. de Supinski. Exascale algorithms for generalized MPI_Comm_Split. In EuroMPI 2011, 2011. [3] P. Sack and W. Gropp. A scalable MPI_Comm_Split algorithm for exascale computing. In R. Keller, E. Gabriel, M. Resch, and J. Dongarra, editors, Recent Advances in the Message Passing Interface, volume 6305 of Lecture Notes in Computer Science, pages 1-10. Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-15646-5 1. [4] J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and G. Jin. A new vision for coarray fortran. In PGAS '09: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, pages 1-9, New York, NY, USA, 2009. Appendix A. Other Important Features In this section, we discuss features that we view as critically important for coarray-based Fortran appliations. Although adding the collection of features listed below would violate the "zero sum" principle that the number of additions must balance the number of deletions, we believe that it is important to add them to the language. A.1. Atomic Operations As suggested by Bill Long in N1835, atomic operations should be added to the language to leverage hardware support for direct read-modify-update operations, both locally and across an interconnect. The list of operations to support includes: . atomic_cas . atomic_add . atomic_fadd . atomic_and . atomic_fand . atomic_or . atomic_for . atomic_xor . atomic_fxor where 'f' indicates a fetch of the old value. A.2 Predicated COPY_ASYNC The predicated COPY_ASYNC operation in our CAF 2.0 prototype allows one to explicitly overlap communication of arbitrary amounts of data with computation, and to specify precisely when the copy may begin, precisely when the source data may be overwritten, and precisely when the destination data may be read. Syntactically it is defined as: copy async(var_dest, var_src [, ev_dr] [, ev_cr] [, ev_sr]) where: var_dest is a coarray reference target var_src is a coarray reference source ev_dr (aka destination ready) is an optional event indicating that the write to var dest is complete and var_dest can be read safely ev_cr (aka copy ready) is an optional event indicating that the copy can start ev_sr (aka source ready)is an optional event indicating that the read of var_src is complete and var_src can be overwritten safely A.3. Asynchronous Collectives Asynchronous versions of collective operations enable one to overlap computation with the communication inherent in effecting the collective. Syntactically, asynchronous versions would have the same format as their synchronous counterparts, but they would include one additional parameter: an event to be notified upon completion of the collective. Calling the asynchronous collective would be a non-blocking operation that establishes the collective and returns; when the results of the collective are needed a call to EVENT_QUERY would block until completion. A.4 Copointers CAF 2.0 adds global pointers to the Fortran language which support references to remote coarray sections as well as distributed linked data structures. The definition and use of these new "copointers" is intentionally similar to ordinary Fortran pointers: They are declared with new attributes analogous to 'pointer' and 'target', manipulated with the existing '=>' pointer assignment statement, and inspected with the existing pointer intrinsics. Accessing data via copointers is similar to existing coarray access, with implicit access to the local image and explicit access to remote images using a square-bracket notation. CAF 2.0's copointers may point to values of any type, including coarrays. Setting up copointers once in the initialization section of a program can lead to dramatically simpler reading and updating of halo regions on neighboring processors. The following code fragment illustrates an example usage of copointers. integer, dimension(:), copointer :: p7, p8 ! copointer to array ! of integer integer, dimension(10), cotarget :: a3[*] ! coarray of array of ! integer p7 => a3 ! copointer to a3's ! local coarray section p8 => a3[9] ! copointer to a3 on ! image 9 p7(6) = 1 ! assigns 6th element ! of the local section ! of a3 p8(6)[] = 42 ! assigns 6th element ! of the target remote ! coarray We note that accessing remote data via copointers remains explicit; this conforms to the spirit of coarray Fortran extensions in maintaining visual cues to mark remote operations. Further details of copointers can be provided upon request. ........................................................................... 4. Robert Numrich Robert telephoned me, saying that he would like the extension in the TS to be small. He supports the revised set of collectives proposed by Bill Long in N1835, but without the team argument. He asked for three of his comments in N1835 to be reiterated: a. The intrinsic function this_image() The function this_image should allow a scalar return value for coarray arguments with just one codimension: integer :: me real :: x[*] me = this_image(x) Internally the function may continue to think it is returning an array of length one, but the programmer should not be penalized for that. Let the value on the left side of the assignment statement be a scalar. At most, issue a warning at compile time. I hit this problem every time I write new code. It is embarrassing trying to explain it to a new coarray programmer. f. Teams Remove Teams completely from the proposed extensions. [For his rationale, see N1835.] g. Notify/Query We should hasten slowly with these statements. The current definition is probably wrong. There probably needs to be some sort of tag associated with these statements making them look more like events. ........................................................................... 5. Nick Maclaren NO, with comments. I have not had time to think about this area, but I agree with Reinhold Bader and Robert Numrich, for similar reasons to them. I have rechecked the WG5 Garching minutes and, while we did not formally agree to Reinhold's point (2), I recall there being a consensus that it was a necessary step within J3. From my experience of MPI and OpenMP, Robert has pointed out the two most difficult parts of N1858 to specify semantically, unless we both have missed something subtle. We need to go cautiously, and watch out for the semantics. Specifying the syntax is the easy bit. ........................................................................... 6. Tobias Burnus As a Fortran user and Fortran-compiler developer, I am in favour of a smaller TS. Given that the number of compilers well supporting coarrays (as defined in Fortran 2008) is very low, the practical experience with the newer features is still rather low. Thus, adding the most important missing features as TS allows to draw on user experience for additional features during the F201{3,8} development. For me, the collective/broadcast feature is the most important omission; having some subdivision (teams) or collective I/O seems to be of lesser importance (for my projects at least).