ISO/IEC JTC1/SC22/WG5 N1930 Requirements for additional parallel features in Fortran Bill Long, 28-Jun-2012 A Technical Specification, "Additional Parallel Features in Fortran", is proposed. 1. Overall size S1. The complexity of the TS should be comparable with that of document N1858, from the point of view of both implementation and edits to the standard. This is the essence of Resolution G9 of the Garching meeting (see N1861). This set of requirements specifies a TEAM facility different from the one in N1858, an EVENT facility as an alternative to the NOTIFY/QUERY facility in N1858, and a simpler set of collective subroutines. It adds new intrinsic procedures for atomic memory operations, but omits the parallel I/O facilities in N1858. On balance, the requirement S1 is satisfied. 2. Teams Teams provide a capability to restrict the image set of remote memory references, coarray allocations, and synchronizations to a subset of all the images of the program. This simplifies writing programs that involve segregated activities (parts of a climate model, for example) that might be more easily be written independently or may have already been written as independent programs. Teams also provide a mechanism for subdividing the computation for the sake of better performance (such as within local SMP domains). Finally, teams provide the capability to execute procedures (such as library procedures) that use coarrays internally on a subset of the images of a program. T1: When a block of code is executed on images executing as a team, it should execute on those images as if the program contained no other images. This has the following implications: 1. Image indices shall be relative to the team, starting at 1 and ending with the number of images in the team. 2. Collective activities that would involve all images, such as SYNC ALL, allocation and deallocation of coarrays, collective subroutine execution, and inquiry intrinsics such as THIS_IMAGE and NUM_IMAGES shall be relative to the team. T2: While an image executes a statement it shall be a member of one and only one team. Access to variables on images outside the current team is not permitted. T3: It should be possible to split a team into mutually exclusive subsets that are themselves teams. This should be dynamic in order to allow different groupings of images during different stages of execution. It is desirable to have a compact mechanism for an image to specify which team it wishes to belong. T4: There shall be a construct mechanism for changing the current team, involving the synchronization of all members of the teams at the beginning and end of the construct. The construct shall support separate execution blocks based on team membership. The construct shall make apparent (both to the system and the programmer) where team execution begins and ends. T5: There shall be a type for variables identifying a team collection (probably an opaque derived type defined in the intrinsic module ISO_FORTRAN_ENV). T6: There needs to be a mechanism to find the image index relative to the set of an ancestor team. This might best be done by adding an optional argument to IMAGE_INDEX that specifies the ancestor team. T7: An allocatable coarray that is allocated within a team construct shall be deallocated before execution of the team construct terminates. An coarray that was allocated in a parent team shall not be deallocated within an child team construct. T8: The restriction that standard input is attached only to image 1 is unchanged, and the designated image is image 1 of the original set of images present at program startup. 3. Collectives A collective subroutine is an intrinsic subroutine that is executed by a set of images. It performs a computation based on values on the images of the set. Collective subroutines offer the possibility of substantially more efficient execution of reduction operations than would be possible by non-expert programmers. Corresponding routines are widely used in MPI programs. C1: A call to a collective subroutine is not an image control statement. However, such a call shall appear only in a context that allows an image control statement. Even though calls to collective subroutines involve internal synchronization required by the usual rules for reference and definition of subroutine arguments, they do not facilitate ordering of segments. C2: If a collective subroutine is invoked on one image, it shall be invoked by the same statement on all images of the current team. C3: A collective subroutine based on a user-written procedure that applies the required operation to local variables shall be provided. In addition, because they are often needed, there should be specific collective subroutines for SUM, MAX, and MIN for intrinsic types for which the corresponding operations are defined. Forms that provide the result to just one image or to all the images involved should be provided. Beyond this, there should be a collective subroutine that broadcasts a value on one image to a set of images. Coindexed source and result arguments are not permitted. 4. Additional intrinsic atomic subroutines Atomic memory operations provide powerful low-level primitives for synchronization of activities among images without use of heavy-weight synchronization and lock statements. They can provide substantial performance advantages. A1: Atomic intrinsic subroutines shall be provided for atomic-compare-and-swap, atomic-integer-add, atomic-bitwise-and, atomic-bitwise-or, and atomic-bitwise-xor. For the integer add and bitwise logical operations, both the direct and "fetch-and" versions should be supplied. 5. Synchronization using events The NOTIFY and QUERY statements were proposed in N1858, but for matching the execution of a NOTIFY statement on one image with the execution of a QUERY statement on another image, the feature relied on the numbers of times the statements were executed on the images. This mechanism is not robust in the presence of segment reordering; for example, an image that would otherwise be idle might bring other work forward. The preferred mechanism involves tagged events. The tagging aspect is important for employing this capability in a library routine in such a way that is hidden from, and does not interfere with the caller. E1: There should be a mechanism to allow one-sided ordering of execution segments. For example, suppose image I executes successive segments I1 and I2 and image J executes successive segments J1 and J2; there might be a need for I1 to precede J2 without the need for J1 to precede I2. E2: The mechanism should use a data item (tag), accessible on all the images, to identify the event. There shall be a type for variables used as these tags (probably an opaque derived type defined in the intrinsic module ISO_FORTRAN_ENV). E3: Mechanisms shall be provided to post and test an event, and wait for an event to be posted. Repeated posts to the same event increment a counter internal to the tag and wait decrements the counter. The statements implementing event post and wait are image control statements. The test operation may be implemented by an inquiry function, and hence would not an image control statement. E4: If multiple event wait operations specify the same event variable, it is unspecified which one of these operations completes when the corresponding event post occurs.