ISO/IEC JTC1/SC22/WG5 N1835 Requirements for TR of further coarray features John Reid 10 September 2010 Resolution LV5 of the WG5 meeting in Las Vegas, Feb. 2008 reads: "LV5. Content and processing of TR on Enhanced Coarray Facilities That WG5 declares that the content of the Technical Report on Enhanced Coarray Facilities in Fortran is as shown in document J3/08-131r1. Further, WG5 expects the TR to be published during the second quarter of 2011." At the WG5 meeting in Las Vegas, Feb. 2008, the target date for publication was changed to November 2012 (see resolution LV5 and document N1812), but there has been no discussion since 2008 on the technical content. The aim of this paper is to explore alternatives to the features of J3/08-131r1, which are: 1) Collective intrinsic subroutines: CO_ALL CO_ANY CO_COUNT CO_MAXLOC CO_MAXVAL CO_MINLOC CO_MINVAL CO_PRODUCT CO_SUM 2) Teams and features that require teams: Team formation and inquiry; FORM_TEAM, TEAM_IMAGES intrisics, and the IMAGE_TEAM type. SYNC TEAM statement TEAM specifiers in I/O statements 3) The NOTIFY and QUERY statements. 4) File connected on more than one image, except for the files preconnected to the units specified by OUTPUT_UNIT and ERROR_UNIT. A draft TR, containing exactly these features, is visible as J3/10-166, and this paper will use J3/10-166 as its base document. I hope that the technical content of the TR can be decided at the WG5 meeting in June 2011. This paper is arranged as a set of proposals, each with a summary and (where appropriate) its technical details. The titles are Proposal 1. Change the set of collectives Proposal 2. Add atomic compare-and-swap and other atomic subroutines Proposal 3. Proposals from Bob Numrich with comments Proposal 4. Add coscalars Proposal 5. Allow asynchronous execution of the collectives Proposal 6. Reconsider the handling of coarrays in a team Proposal 7. Add coarray pointers Proposal 8. Allow asymmetric allocatable and pointer objects Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements --------------------------------------------------------------------------- Proposal 1. Replace the set of collectives Suggested by: Bill Long. Summary Replace the collective intrinsic subroutines by the new set CO_BCAST CO_MAX CO_MIN CO_REDUCE CO_SUM which are not image control statements. The current collective subroutine CO_SUM lacks some features that would improve its usability and improve performance. A new version is desired with these enhancements: 1) The RESULT variable should be optional. If it is not present, the result of the computation is assigned to the SOURCE argument. Rationale: The current specification requires declaring a second variable to be used for the RESULT, which is often unnecessary. 2) SOURCE, and RESULT if present, should be allowed to be non-coarrays. Rationale: This significantly expands the potential usability of the routine, particularly in the context of integrating coarrays into existing codes. Internally, the routine could have a coarray of a derived type with a component that is a pointer to the supplied SOURCE or RESULT argument, and perform the computation using that structure. As an optimization, for the case of scalar argument, the routine could have an internal coarray into which the source is copied at entry and from which the result is copied at completion. 3) Add a new optional argument, RESULT_IMAGE. If this is present, the result is assigned only on the identified image, and not broadcast to all the images. On all other images, the result variable becomes undefined. Rationale: This is a reasonably common usage, and eliminating the broadcast improves performance. The designs for CO_MAX and CO_MIN follow that of CO_SUM. Much of the same infrastructure can be reused. CO_MAX and CO_MIN differ from CO_SUM in that they allow arguments of type character, and do not allow arguments of type complex. The new intrinsic subroutine CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM]) would broadcast a value to all images of a team. The new intrinsic subroutine CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE]) would provide a general routine for operations not currently covered. This subroutine could also handle arguments of derived type as long as the specified operation was defined for the type. The specification follows that for CO_SUM, with the addition of an OPERATION argument. Technical details CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM]) Description. Broadcast of a value to all images of a team. Class. Collective subroutine. Arguments. SOURCE shall be a coarray. It is an INTENT(INOUT) argument. SOURCE becomes defined on all images of the team with the value of SOURCE on image SOURCE_IMAGE. SOURCE_IMAGE shall be type integer. It is an INTENT(IN) argument. Its value shall be the image number of one of the images in the team. TEAM (optional) shall be a scalar of type IMAGE_TEAM. It is an INTENT(IN) argument that specifies the team for which the broadcast is performed. If TEAM is not present, the team consists of all images. Example. If SOURCE is the array [1, 5, 3] on image one, after execution of CALL CO_BCAST(SOURCE,1) the value of SOURCE on all images is [1, 5, 3]. ........................................................ CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE]) Description. General reduction of elements on a team of images. Class. Collective subroutine. Arguments. SOURCE shall be of a type for which the operation specified by the OPERATION argument is defined. It is an INTENT(INOUT) argument. It may be a scalar or an array. If it is a scalar, the computation result is equal to a processor-dependent and image-dependent approximation to the application of the operation specified by the OPERATION argument to the values of SOURCE on all images of the team. If it is an array, the value of the computation result is equal to a processor-dependent and image-dependent approximation to the application of the operation specified by the OPERATION argument to all the corresponding elements of SOURCE on the images of the team. If RESULT is not present, value of the computation result is assigned to SOURCE. If REULT is present, SOURCE is not modified. OPERATION shall be an external procedure that defines the binary, commutative operation to be performed. The specified procedure shall have two scalar arguments of the same type and type parameters as SOURCE, and return a result of the same type and type parameters as SOURCE. The result of executing the procedure is the value of performing the intended operation with the two arguments as operands. RESULT (optional) shall be of the same type, type parameters, and shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present, the value of the computation result is assigned to RESULT. TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an INTENT(IN) argument that specifies the team for which CO_SUM is performed. If TEAM is not present, the team consists of all images. RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN) argument. Its value shall be the image number of one of the images in the team. If RESULT_IMAGE is present and RESULT is present, the result of the computation is assigned to RESULT on image RESULT_IMAGE and RESULT on all other images becomes undefined. If RESULT_IMAGE is present and RESULT is not present, the result of the computation is assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other images becomes undefined. Example. If the number of images is two and SOURCE is the array [1, 5, 3] on one image and [4, 1, 6] on the other image, and MyADD is a function that returns the sum of its two integer arguments, the value of RESULT after executing the statement CALL CO_REDUCE(SOURCE, MyADD, RESULT) is [5,6,9]." ................................................. CO_SUM (SOURCE [, RESULT, TEAM, RESULT_IMAGE]) Description. Sum elements on a team of images. Class. Collective subroutine. Arguments. SOURCE shall be of numeric type. It is an INTENT(INOUT) argument. It may be a scalar or an array. If it is a scalar, the computation result is equal to a processor-dependent and image-dependent approximation to the sum of the value of SOURCE on all images of the team. If it is an array, the value of the computation result is equal to a processor-dependent and image-dependent approximation to the sum of all the corresponding elements of SOURCE on the images of the team. If RESULT is not present, value of the computation result is assigned to SOURCE. If REULT is present, SOURCE is not modified. RESULT (optional) shall be of the same type, type parameters, and shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present, the value of the computation result is assigned to RESULT. TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an INTENT(IN) argument that specifies the team for which CO_SUM is performed. If TEAM is not present, the team consists of all images. RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN) argument. Its value shall be the image number of one of the images in the team. If RESULT_IMAGE is present and RESULT is present, the result of the computation is assigned to RESULT on image RESULT_IMAGE and RESULT on all other images becomes undefined. If RESULT_IMAGE is present and RESULT is not present, the result of the computation is assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other images becomes undefined. Example. If the number of images is two and SOURCE is the array [1, 5, 3] on one image and [4, 1, 6] on the other image, the value of RESULT after executing the statement CALL CO_SUM(SOURCE, RESULT) is [5,6,9]." ------------------------------------------------------------------------- Proposal 2. Add atomic compare-and-swap and other atomic subroutines Suggested by: Bill Long. Summary Several people external to WG5, and several members of the committee, have proposed that in addition to ATOMIC_DEFINE and ATOMIC_REF it is very useful to add an atomic read-modify-write intrinsic. The most basic of those in both theoretical works and also practical implementations is the atomic compare-and-swap (CAS) intrinisc. The basic operation of this intrinsic is: atomic_cas (atom, old, compare, new) which performs atomically: old = atom if (old == compare) atom = new The following further atomic subroutines are suggested: atomic_add atomic_fadd atomic_and atomic_fand atomic_or atomic_for atomic_xor atomic_fxor where the 'f' versions are the "fetch_and_" versions of the ones with out the 'f'. These have existed (with different spelling) in the Cray coarray implementation from the beginning due to specific customer demands. All take integer arguments. Having standardized and portable names would be good. Technical Details ATOMIC_CAS (ATOM, OLD, COMPARE, NEW) Description. Conditionally swap values atomically. Class. Atomic subroutine. Arguments. ATOM shall be scalar and of type integer with kind ATOMIC_INT_KIND or of type logical with kind ATOMIC_LOGICAL_KIND, where ATOMIC_INT_KIND and ATOMIC_LOGICAL_KIND are the named constants in the intrinsic module ISO_FORTRAN_ENV. It is an INTENT (INOUT) argument. If the value of ATOM is equal to the value of COMPARE, ATOM becomes defined with the value of INT (NEW, ATOMIC_INT_KIND) if it is of type integer, and with the value of NEW if it of type logical. OLD shall be scalar and of the same type as ATOM. It is an INTENT (OUT) argument. It becomes defined with the value of INT (ATOMC, KIND (OLD)) if ATOM is of type integer, and the value of ATOMC if ATOM is of type logical, where ATOMC has the same type and KIND as ATOM and has the value of ATOM used for the compare operation. COMPARE shall be scalar and of the same type and kind as ATOM. It is an INTENT(IN) argument. NEW shall be scalar and of the same type as ATOM. It is an INTENT(IN) argument. Example. CALL ATOMIC_CAS(I[3], OLD, Z, 1) causes I on image 3 to become defined with the value 1 if its value is that of Z, and OLD to become defined with the value of I on image 3 prior to the comparison." ------------------------------------------------------------------------ Proposal 3. Proposals from Bob Numrich with comments Suggested by: Bob Numrich a. The intrinsic function this_image() The function this_image should allow a scalar return value for coarray arguments with just one codimension: integer :: me real :: x[*] me = this_image(x) Internally the function may continue to think it is returning an array of length one, but the programmer should not be penalized for that. Let the value on the left side of the assignment statement be a scalar. At most, issue a warning at compile time. I hit this problem every time I write new code. It is embarrassing trying to explain it to a new coarray programmer. b. Remove restrictions on derived types with coarray components Remove most, if not all, the restrictions on derived types with coarray components. I can't remember all the restrictions, but I think there are lots of them. For those restrictions that absolutely cannot be removed, provide a clear explanation of why. In particular, remove the restriction that a child type can add a coarray component only if its parent has a coarray component. It messes up inheritance by, for example, forcing every abstract type to contain a dummy coarray component just in case somebody wants to extend it by adding a coarray component, which will often be the reason it is being extended. c. Alternative sync statement There only needs to be one sync() statement with different arguments: integer :: list(:) sync() or sync ! sync with all images; behaves just like sync all sync(list) ! sync with images in list(:); ! behaves like sync images(list) sync(memory) ! local memory sync; behaves like sync memory Existing sync statements remain valid if the programmer wants to use them. See proposal f below for team sync. d. Functions with side effects Programmers may write functions with side effects, such as internal syncs or allocation of coarrays, but they have no guarantee that they will be executed in the same order as listed on the program statement or executed at all. The ordering of segments assumed by the programmer is therefore broken. Functions with this kind of side effect need a new attribute that requires such functions to be executed and executed in the order written in the program statement. The attribute IMPURE would be a natural choice had it not already been used to mean something else. Functions should be allowed to return objects with coarray components. Constructors require this capability. A workaround using an overloaded assignment statement is very awkward and frankly embarrassing. e. Collectives Collectives should be part of a support library not part of the language. If we make them part of the language, we need to be very careful how we define them. The UPC people have been arguing about them for years. Duplicating MPI collectives should not be the goal. If the coarray model is compatible with MPI, why not just use the MPI collectives? If we do include them, collectives should be functions (with side effects as in proposal d): s = co_sum(x) They should be simple, mimicking the normal functions s = sum(x) No long list of arguments please. See proposal f for collectives within teams. Every image must invoke the function, and every image gets the result on return from the function. The argument x need not be a coarray. Since they are collective, they imply a segment boundary upon entry. Each image can return from the function independently as soon as it receives the value of the result. I would like to see some evidence that nonblocking collectives really make a difference in the overall performance of a real application, not just a kernel. f. Teams Remove Teams completely from the proposed extensions. The ability to couple two coarray codes already exists using MPI intercommunicators or one of many frameworks out there. These frameworks just need to allow coarray codes as components. Teams are a very big addition to the language, and we should hasten slowly. A coarray code should be the same whether it is run alone or run as a team coupled with another coarray code. With the current definition of teams this is not true. Both codes will need to be altered to run as teams. Dereferencing codimensions relative to a team is a very big problem. Leaving it up to the programmer is very, very difficult and very error prone. All coarray references will need to be changed so they are relative to the team. Symmetric memory will be broken if we allow allocation within teams. We should not do teams. If we go ahead with teams, somebody has to figure out how codimensions are dereferenced relative to a team. Otherwise, teams are pretty useless. If a coarray is declared in code only executed by a team, is the coarray visible across all images or just within the team? If we allow allocation within a team, the allocate should be a method associated with the team object: Type(Team_Object) :: myTeam real,allocatable :: x[:,:] stat = myTeam%allocate(x[p,*]) This allocate implies a sync within the team. A coarray allocated this way could be given a state that includes information on how to dereference codimensions. Do image indices then start with one or start with the first image in the team? How do we deal with asymmetric heaps? Are coarrays allocated by one team visible to other teams? How? Collective functions within a team should be associated with team objects: Type(Team_Object) :: myTeam s = myTeam%sum(x) Synchronization within a team should be a procedure associated with the team object: Type(Team_Object) :: myTeam myTeam%sync() ! sync with images in myTeam myTeam%sync(list) ! sync with a subset of images in myTeam Other associated functions: myTeam%isMyTeam() myTeam%myTeamIndex() myTeam%teamList() But I repeat, we should not add teams. g. Notify/Query We should hasten slowly with these statements. The current definition is probably wrong. There probably needs to be some sort of tag associated with these statements making them look more like events. h. Locks Add some logical functions associated with lock variables: type(Lock_Object) :: lck[*] if(lck%isMyLock()) then unlock(lck) end if Otherwise, all images must attempt to unlock the variable and deal with an error code. In the same way, the function IsLocked() determines if a lock is already locked: if(.not. lck%isLocked()) then lock(lck) end if This function also allows spinning on a lock until it is free. i. MPI, OpenMp, UPC, CUDA compatibility Is the coarray model compatible with other programming models? ------------------------------------------------------------------------- Proposal 4. Add coscalars Suggested by: Reinhold Bader Summary Add "coscalars". A coscalar exists on a single image and is referenced by appending []. It may be a scalar or an array. It may be a pointer, allocatable, or neither. When a pointer or allocatable coscalar is allocated, the programmer can choose the host image; otherwise, the host image is processor dependent and does not change during the lifetime of the coscalar. The main application is for program-wide linked lists that are modified rarely, but accessed frequently. There are advantages in having coscalar locks. Technical Details 1. Introduction: ~~~~~~~~~~~~~~~~ The intent of this paper is to bring forward arguments in favour of including unsymmetric shared entities (denoted as "coscalars") as part of the Technical Report on Enhanced Parallel Computing Facilities. An informal description of the desired features is provided, which attempts to obey the following constraints: * coscalar functionality is kept as orthogonal as possible to coarrays. Having few interactions between the two features should allow to minimize the implementation effort. * coscalar syntax and semantics follows the design principles for coarrays as far as is possible with respect to visual indication of communication and with respect to synchronization semantics. A key feature is the possibility to allow subobjects of a derived type entity which may be hosted on an image different from that hosting the parent object. This is achieved by introducing coscalar pointers to coscalars; see sections 5 and 6.1 for details. The suggested language elements are analogous to shared scalar entities and shared pointers to shared entities in UPC, however with additional provisions and restrictions to increase safety of use, as well as suggestions for performance tuning. 2. Complex data structures and their scalability limitations: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The main scenario targeted by the feature described in this paper is the case of general, program-wide data structures like lists, deques, binary trees, oct-trees etc. which are modified once or only rarely, but traversed and referenced often throughout the execution of the program. For example, MPI codes which need to perform dynamic load balancing during program execution typically implement this kind of concept manually (and with great programming effort). While it is possible to implement such concepts using the coarray facilities defined in the base language (for example, by using allocatable components of a coarray of suitable derived type), this is still significantly more complex to program and maintain than e.g., an OpenMP tasking code, and furthermore may require repeated reallocation of coarrays if the size of the data structure is not a priori known, thereby incurring repeated program-wide synchronization and hence scalability issues. For the scenario indicated above, the arguments against the use of shared pointers becomes less relevant since * the workload should typically be considerably larger than the latency for accessing a shared pointer is. * the double latency incurred for dereferencing a shared pointer as well as accessing its target may be reduced by caching the descriptor information on all images requiring this information during synchronization phases. 3. Coscalar declaration, definition and reference: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A coscalar entity is declared in one of the following ways: real, codimension[] :: cs1 ! real scalar coscalar real :: cs2[] ! alternative declaration real, target :: ca1(ndim)[] ! real array coscalar Syntactically, the difference between coscalar and coarray is that a coscalar does not specify a coshape; the corresponding semantics is that the declared entity is shared between all images. Notwithstanding, there is one image which is considered the "hosting image" of the coscalar. The hosting image can be identified via the image_index() intrinsic, thereby allowing the programmer to tune code for efficiency of access. For a statically declared coscalar the hosting image is processor-dependent; it is the same image throughout the coscalar's existence. Every definition or reference of a coscalar requires specification of the angular brackets: x = cs1[] cs2[] = a[4] This maintains the visual indication of the occurrence of communication already known from coarrays; for the purpose of notation it is assumed that it is the exception rather than the rule that the hosting image references or defines a coscalar. On the other hand, an implementation may generate multiple code versions depending on whether the access is or is not local, thereby assuring improved speed for local accesses. The sequence of definitions and references of coscalars follows the same synchronization rules as coarrays do. For example, in if (this_image() == 1) then cs1[] = ... sync images (*) else sync images(1) x = cs1[] end if the SYNC IMAGES statements are required to prevent a data race between the single image which defines cs1 and all the others which reference it. 4. Allocatable coscalars: ~~~~~~~~~~~~~~~~~~~~~~~~~ To enable control of memory locality by the programmer, a coscalar with the allocatable attribute can be allocated on a specific image: real, allocatable, codimension[] :: cs3 : allocate(cs3, image=4) where the IMAGE argument to the ALLOCATE statement is obligatory; on images other than the one specified the statement will have no effect (and it is of course a violation of the synchronization rules if two images attempt to allocate an unallocated entity in unordered segments). A subsequent call to this_image(cs3) or allocated(cs3) on any image in a segment executed after the one during which the allocation is performed will return the values 4 and .TRUE., respectively. It is required that the hosting image perform the deallocation: deallocate(cs3) Executing this statement on images other than that hosting the coscalar will not have any effect. If applied to coscalars (and local entities) only, neither the ALLOCATE nor the DEALLOCATE statements will perform any synchronization. This improves scalability especially if only small subsets of images (or only teams) need to access the coscalar. The ALLOCATED intrinsic may also be used on images which do not host the pointer or its target; it is atomic in that it may be executed in a segment unordered with respect to that performing the allocation or deallocation; however it is the programmer's responsibility to properly deal with race conditions which may result from such a use, especially in the case of deallocation. 5. Pointers to coscalars: ~~~~~~~~~~~~~~~~~~~~~~~~~ A coscalar pointer to a shared entity is declared by specifying the pointer attribute for a coscalar: real, pointer :: cp(:)[] type(team_array), pointer, codimension[] :: tp(:) Such an entity is itself a coscalar (with a processor dependent hosting image unless it is a type component), and it may be pointer associated with a shared entity with the target attribute: if (this_image() == R) cp[] => ca1(:) Since image R may be distinct from either the image hosting the coscalar pointer, or the image hosting its target, the above pointer assignment statement will involve up to three images; note that a similar situation also can occur when using regular assignment with differently coindexed objects on both sides of the assignment. Image control statements to perform synchronization prior to subsequent references or definitions only are required against the image R executing the pointer assignment. Also, it would be allowed to define the target in segments unordered with respect to the one executing the above pointer assignment, since only transfer of a descriptor is required (for this reason the angular brackets are omitted from the right hand side); in this case synchronization would need to include the image performing such a definition. Subsequent references to cp[] then go to the target: x(3) = cp(3)[] ! same as x(3) = ca1(3)[] (One could consider also allowing coindexed objects as targets.) It is possible to dynamically allocate the target on a single image: allocate(tp(num_images()), image=1) As with allocatable entities, deallocation later must be performed on the hosting image. The NULL() and NULLIFY() intrinsics are also available for coscalar pointers; these may also be used on images which do not host the pointer or its target; if they do so they must be called in a segment ordered with respect to any segment changing the association or definition status of the pointer. The ASSOCIATED() intrinsic, similar to ALLOCATED() is atomic; in its two argument form both arguments must be coscalars if one is. The target's hosting image may be determined using the IMAGE_INDEX() intrinsic on the coscalar pointer associated with it. Finally, just as for regular pointers, it is possible to specify the CONTIGUOUS attribute for a coscalar pointer, in which case its target must be simply contiguous. 6. Derived types: ~~~~~~~~~~~~~~~~~ The desired properties of shared general data structures rests on the possibility to define coscalar subobjects which may be hosted on an image different from that hosting the parent data object. A number of restrictions are required to assure that no remote allocations or deallocations are needed wherever dynamic type components are involved. Combinations of coscalars and coarrays are disallowed in the derived type context i.e. a coarray may not have coscalar type components, and a coscalar may not have coarray type components. 6.1 Distributed structures ~~~~~~~~~~~~~~~~~~~~~~~~~~ A coscalar may appear as a type component provided it has the POINTER attribute. This allows for a directory-like programming style: type :: team_array real, pointer, contiguous :: x(:)[] end type type(team_array), allocatable :: o(:)[] allocate(o(num_images()), image=myteam_first_image) sync team (myteam) if (member_of(myteam)) allocate(o(this_image())%x(localsize), & image=this_image()) sync team (myteam) ! synchronize across team only After the SYNC TEAM, an image in the team may now define o(any_other_team_index)[]%x(:)[] = ... where the subobject is hosted on image any_other_team_index (which may be an image other than that allocating the entity o). The image hosting the coscalar pointer component (not its target!) is the image hosting the parent object. For simplicity of implementation it is suggested that the parent object of such a type is required to be a coscalar. 6.2 local (dynamic) type components ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Given a type definition with (regular) pointer or allocatable components, it is possible to declare coscalars of such a type; any subobject of such an entity is also a coscalar, and its allocation or association status may only be changed on the image hosting its parent object: type :: ptr_type real, pointer :: x(:) end type type(ptr_type) :: o[] real, target :: y(5) if (image_index(o) == this_image()) then y = ... o[]%p => y end if sync all y = o[]%p ! scatter For pointer components this ensures that pointer association with a local object is well defined. type :: alloc_type real, allocatable :: x(:) end type type(alloc_type) :: a[] allocate(a%x(5), image=image_index(a)) ! a%x is a coscalar sync all if (this_image() == 1) then a(:)[] = ... end if For allocatable components this ensures that no remote deallocation is required when the object goes out of scope. 6.3 polymorphic entities/subobjects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The case of either polymorphic coscalars, or polymorphic components of coscalars presumably will require similar restrictions as the coarray case. Details of these will need to be figured out. 7. Tuning considerations ~~~~~~~~~~~~~~~~~~~~~~~~ The envisaged main usage scenario for coscalars is the situation where an object is generated once and then referenced multiple times on some or all images (one or few writes, many reads). This section contains some thoughts on how tuning of code using coscalars could be performed. 7.1 Caching ~~~~~~~~~~~ Therefore, while a coscalar always has a hosting image, an implementation may choose to cache the entity or parts of an entity on (some) other images. If so, some garbage collection scheme will be required to dispose of the cached copies once the coscalar is deallocated or leaves scope. The manner in which caching is performed will be processor dependent, but the expectation will be that a high quality implementation will perform caching * on coscalar pointers to reduce access latency * on sufficiently small items One could consider providing an additional collective intrinsic to enforce caching. Otherwise, implementation-dependent caching would be controlled by execution of an image control statement. Finally, especially for the case in which locality control is exerted by the programmer (see below), or if it is known that an entity requires a large amount of memory resources (perhaps only available to a subset of images), an attribute can be specified to suppress caching: type(alloc_type), uncached :: a[] 7.2 Locality control ~~~~~~~~~~~~~~~~~~~~ As a convenience for implementation of load-balancing algorithms, a statement for changing the hosting image of an allocatable coscalar or a coscalar pointer target is provided: relocate (a, image=4[, team=...][, sync='YES|NO']) would change the hosting image of a to 4. This statement must be executed collectively by all images (of a team). If a team argument is present, the specified image as well as the image hosting the entity to be relocated must be a member of the team. The statement implies synchronization of all images executing it unless the SYNC argument is specified with a value of 'NO'. If this is the case, it is the programmer's responsibility to insert synchronization statements before subsequent references/definitions of the entity. Relocation only applies to the parent object in case the object has coscalar pointer components; the latter's targets remain on their hosting images. Regular pointer components of a relocated coscalar become undefined, and execution of a relocate statement may in this case induce a memory leak. An allocatable component of a relocated coscalar is reallocated on the new hosting image, and its content is transferred to the relocated entity. 7.3 Note on symmetric heaps ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In many cases, the intent for distributed structures is to achieve a balanced filling of the memory across images. Hence, an implementation might be able to use a symmetric heap even in this case, and allocate this in moderately large blocks, with an additional level of indirection for accessing the data items in the structure. For such implementations, a compiler directive allowing the programmer to indicate a suitable block size might be useful. 8. Coscalars and subprograms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8.1 subprogram-local coscalars ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similar to coarrays, coscalars declared inside subprograms are required to have the SAVE attribute, and automatic (array) coscalars are not allowed, since the need to have a well-defined hosting image would imply a need for synchronization. However, coscalars may be locally declared in a subprogram without SAVE if they are allocatable or have the pointer attribute; it is then the programmer's responsibility to ensure valid accesses by performing allocation, deallocation and inserting image control statements. An allocatable coscalar is automatically deallocated once the image hosting it completes execution of the subprogram. 8.2 coscalar dummy arguments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A subprogram dummy argument may be a coscalar, in which case the actual argument must be a coscalar (the latter is then provided without the angular brackets). Similar to the coarray case, restrictions are in place which assure that no copyin/out occurs. The dummy arguments' hosting image is the same as that of the actual argument. If a coscalar dummy argument has the POINTER or ALLOCATABLE attribute, the actual argument must be a coscalar with the same attribute. 8.3 Coscalar actual arguments matching a non-coscalar dummy argument ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the dummy argument is not a coscalar, the actual argument may be a coscalar anyway, in which case typically copy-in/out will be required. In this case the same additional synchronization rule applies for modifiable arguments as for the corresponding case of an coindexed actual argument. The actual argument must specify the angular brackets. 8.4 Generic interfaces ~~~~~~~~~~~~~~~~~~~~~~ Similar to coarrays, no generic disambiguation is possible with respect to coscalar arguments. 9. Application to locks, teams and atomic procedures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the present standard, lock variables are required to be coarrays, which may lead to misinterpretations (typically, only a subset of the num_images() lock variables of a scalar coarray are actually required). Using coscalars, use of locks as well as the team abstraction could be handled more elegantly: type(lock_type) :: my_lock[] type(image_team) :: my_team[] Furthermore, this also facilitates using locks as components of data structures: type :: container type(lock_type) :: lk type(data) :: protected_stuff end type in which case any entity x of type container must be a coscalar, so that x[]%lk also is a coscalar. Similarly, atomic subroutines should be extended to allow scalar coscalars as arguments. Requiring teams to be coscalars instead of regular local entities should provide advantages to both implementors and programmers: * for good scalability, teams can internally make use of coscalar pointer components especially in the case of large image counts * handling teams is much more transparent and intuitive if they're coscalars; the usage pattern (write once, use often) fits perfectly, and if cross-team communication should be supported, say with team(t1) a[i] = b[j]@t2 end with team where t1 and t2 are teams not sharing any image, the shared semantics allows one access team information across team boundaries, something not provided by the present draft. 10. An example: binary tree ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using the following type definition type :: tree type(lock_type) :: lk type(content) :: entry ! entities of type content have < and possibly assignment overloaded logical :: defined = .false. type(tree), pointer :: left[] => null() type(tree), pointer :: right[] => null() end type Concurrent population of such an entity might be performed using the following subprogram, which must be called with the same "this" argument on each image: recursive subroutine insert(this, stuff) type(tree), intent(inout) :: this[] ! must be a coscalar ! since we hand coscalars in ! and so that this[]%lk is type(content), intent(in) :: stuff lock(this%lk) if (this[]%defined) then unlock(this%lk) if (this[]%entry > stuff) then call insert(this%left, stuff) else call insert(this%right, stuff) end if else ! stuff goes to an entry possibly hosted by another image ... this[]%entry = stuff this[]%defined = .true. ! ... but I get to host the siblings ... allocate(this%left, image=this_image()) allocate(this%right, image=this_image()) ! caching of this[]%left and this[]%right to the new target is ! probably a good idea unlock(this%lk) end if end subroutine insert After populating the data structure the workload can be processed via recursive subroutine traverse(this, p) type(tree), intent(inout) :: this[] type(params), intent(in) :: p ! uses a subroutine operation() with non-coscalar dummy arguments to modify entries. if (this[]%defined) then if (image_index(this) == this_image()) call operation(this%entry, p) call traverse(this%left, p) call traverse(this%right, p) end if end subroutine Note that if the calls to traverse() occur in segments ordered with respect to the ones calling insert(), no race conditions occur. Since each image only performs computation on the part of the tree hosted by it, traverse() should scale well if operation() is sufficiently expensive compared to the coscalar pointer lookup. For complete processing, traverse() must be called by all images which previously called insert(). It is not required that insert() be executed by all images. Acknowledgement: ~~~~~~~~~~~~~~~~ Apart from the conceptual derivation from UPC, the basic ideas presented here are a subset from John Mellor-Crummey's papers on his "CAF 2.0" vision; some modifications were done to improve the integration with the language, as well as to enable the programmer to perform optimization through locality control. Comment from Jim Xia I don't like this name. It's very confusing as people might think this refers to a coarray being scalar. So how about a new attribute SINGLE or SHARED? I know SHARED is going to be confusing as well to people who are familiar with UPC. Reply from Reinhold In choosing this, I started from the assumption that co- always refers to something "shared" or "sharable". Since coarrays have a corank, it seemed quite natural to call a corank zero entity a coscalar. ------------------------------------------------------------------------- Proposal 5. Allow asynchronous execution of the collectives Suggested by: Reinhold Bader Allow asynchronous execution of the collectives. This would be redundant if proposal 1 is adopted. ------------------------------------------------------------------------- Proposal 6. Reconsider the handling of coarrays in a team Suggested by: Reinhold Bader Reconsider the handling of coarrays in a team. Desirable features are allocation within a team and a construct that establishes an execution context for a team. Issue (A): multiple objects with the same name on overlapping teams ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A problem not solved by either John Mellor-Crummey's team concept, nor by Bob Numrich's team%allocate function, is that uniqueness of objects is violated. That is, for an entity real, allocatable :: a[:] and the situation where teams b and c overlap, the code with team b allocate(a[*]) end with team with team c allocate(a[*]) end with team creates two distinct objects which commonly exist on the overlapping set of images notwithstanding that they have the same identifier. While this could be tolerated due to the above syntax for team execution, or by providing a notation for specifying team arguments on coarrays, I think it is potentially rather confusing for the programmer; it also seems to some extent to break the concept of local identifier, and as a consequence may be difficult to integrate into the language. Issue (B): reindexing in team execution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From the point of view of composability of existing coarray software it would be very nice if subprograms which are executed by a team (such as via a WITH TEAM block suggested by Mellor-Crummey) perform renumbering of coindices as well as image set arguments to image control statements and coarray-related intrinsics. However, corank 2 and higher or coarrays with lower cobound unequal to 1 introduce a reindexing of their own, and it appears that this cannot be brought into harmony with the team-related reindexing. Also, even if it were possible to enable this functionality (e.g. by introducing suitable restrictions on the kinds of coarrays available for teamed execution), this would preclude access to coarray data on an image which is not a member of the presently executing team. Issue (C): remove teams altogether ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If it is decided that collective functions are to be included in the TR it is nevertheless not desirable to entirely remove teams, since reductions on subsets of images are a very useful feature, and the team concept would support an optimized implementation of this. Similarly, due to the fact that the most efficient parallel I/O profile of large scale parallel programs makes use of image subsets typically an order of magnitude smaller than the complete set of images, the availability of teams in the context of I/O is considered important. For this reason, even if it is decided that extending the handling of coarrays to teams is not possible inside the scope of the TR, teams should still be included to cover the above functionality, perhaps with minor changes so as to not stand in the way of adding coarray-related extensions later on. An attempt to (partially) resolve issues (A) and (B) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The concept of execution context is introduced, which serves to localize - as far as possible - coarray functionality to the executing team. The intent is to bind all coarray entities to their execution contexts; the need to be able to declare context-local entities implies that there will two ways available to the programmer for creating non-default execution contexts: * team-parameterized BLOCK construct * team-parameterized subprograms For an execution context, the requirement is that all images of the specified team are required to execute it, and any image which is not member of the team and attempts to execute the context anyway will simply transfer control to the first statement after the end of the context. As an example, consider the following code: program coupled type(image_team) :: ocean_team[*], atmosphere_team[*] : ! define integer value split if (this_image() < split) then call form_team(team_ocean[1]) else call form_team(team_atmosphere[1]) end if block with team ocean_team[1] ! only images from ocean_team execute this type(ocean_data), allocatable :: my_ocean(:, :)[:] allocate(my_ocean(n1, n2)[*]) : ! work on ocean end block block with team atmosphere_team[1] ! only images from ocean_team execute this type(atm_data), allocatable :: my_atm(:, :)[:] allocate(my_atm(n3, n4)[*]) : ! work on atmosphere ... = num_images() ! size of atmosphere_team sync all ! synchronizes across atmosphere_team only : end block end program Inside the BLOCK construct executed by atmosphere_team, the coarray my_atm exists on images split+1, ..., num_images() from the enclosing execution context. However, execution of coarray intrinsics or image control statements inside the block (in the form defined by F2008) always refers to the local execution context. Access to the data in coarray my_ocean is not possible from the execution context defined by the atmosphere_team block. At the end of the BLOCK construct, the usual rules for deallocation of allocatable entities etc. apply. For a team-parameterized subprogram the team will need to be part of the subprogram's interface, since presumably the compiler will need to propagate information contained in the team to any coarray related functionality inside the subprogram body. Subprogram calls or block execution without a team parameter are performed using the caller's execution context. The next important question is: how are coarrays defined in the enclosing execution context treated inside the local one? The (not entirely satisfactory) answer given here is: * local definitions and references of such entities are trivially resolved (such entities are defined on a superset of the team image set, so there can be no problem). * coindexed definitions and references use the addressing of the execution context to which the entity is bound. In particular, this allows a team to read data which have been generated by another concurrently running team. As a consequence it will sometimes be necessary to * identify an entities' execution context. This could e.g., be done via an intrinsic function which returns a pointer to the team in which an entity has been created: type(team_image), pointer :: p[], q[] p[] => team_of(x) * be able to synchronize images from different teams against each other. This could be done by adding optional TEAM arguments to the image control statements: ! on team p if (this_image() == 1) sync images (*, TEAM=q) would, for example, synchronize local image 1 of team p with all images of team q without pairwise synchronization of images in TEAM q, and sync all ( TEAM=p, TEAM=q ) would perform pairwise synchronization of all images in teams p and q. One reason the answer is not entirely satisfactory is that it is the programmer's responsibility to resolve the mismatch between coindexing of entities defined in the local and enclosing contexts, respectively. Supporting intrinsic functions should of course be provided, for example global_image_index = TEAM_IMAGES(team_of(x), local_image_index) and existing intrinsics may need to be extended with an additional TEAM argument (note that there's a wart with IMAGE_INDEX(), the data argument version of which will probably need to be split off into a separate intrinsic). Introduction of additional restrictions will be required, for example that coarray entities from the enclosing execution context with the ALLOCATABLE attribute may not be allocated or deallocated inside the newly created context, etc. A more radical suggestion, which does provide a satisfactory solution from the ease-of-use point of view and reduces the implementation effort at minor (?) loss in functionality is to entirely prohibit coindexed accesses to entities from enclosing execution contexts. Local accesses would still be allowed, enabling updates to the local portion of such a coarray for later processing by other images. If still desired, the additional functionality could be introduced post TR. ------------------------------------------------------------------------- Proposal 7. Add coarray pointers Suggested by: Jim Xia Add coarray pointers, requiring that the target of the pointer be a local coarray. The primary motivation for this item is to allow coarrays to be used in a function result. One example is to allow a derived type with allocatable coarray components to be used as a target to be associated with a pointer. It seems allowing the POINTER attribute on coarrays is a reasonable solution. I consider this proposal comprised of two separate parts. The first part is to allow a derived type with allocatable coarray components to be used as a target to be associated with a pointer. The following is the original example when I began to think of allowing pointer coarrays: From a user point of view, I'd like to allow the following practice TYPE global_field REAL, allocatable :: f(:)[:] END TYPE TYPE my_field_type type(global_field), pointer :: global => null() REAL, allocatable :: local(:) ... ! type bound operations END TYPE Where my_field_type stores a local copy of global field and can be updated frequently (e.g. intermediate computational results etc). The global field (as a coarray) is only updated whenever there is a need. The type bound operations can be functions returning objects of this data type as long as there is no update on the global field (i,e. there is no violation of segmentation rules). Note this can also be used as a strategy to re-mesh the global field when it is required. The remeshing is encapsulated by my_field_type to hide the information from users (e.g. whenever do the global field update). This declaration, however, is currently not allowed. The second part is the coarray pointers: I'd like to suggest the following syntax REAL, POINTER :: X(:)[:] X can be allocated, or be associated with another coarray target. Allocating X is the same as allocating an allocatable coarray. ALLOCATE (X(M)[*]) ALLOCATE and DEALLOCATE of X is considered collective operations and same synchronizations for allocatable coarrays apply here. X can also be assigned to a coarray target as in X => Y where Y is required to be a coarray target. In concept, each image has its own X associated with a target of its own Y, so there shouldn't be any problems. ------------------------------------------------------------------------- Proposal 8. Allow asymmetric allocatable and pointer objects Suggested by: Bill Long Allow asymmetric allocatable and pointer objects, declared with deferred shape and explicit coshape, e.g. REAL, ALLOCATABLE :: A(:)[*] This provides a mechanism for avoiding the artificial structure workaround and gives users a way to create coarrays that are restricted to a team. The down side is that you cannot call this thing an "allocatable coarray" without having significant side effects elsewhere in the standard. [This was a major reason the idea was dropped previously.] Basically, the object is an orphaned component, but there are no terms for that either. ------------------------------------------------------------------------- Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements Suggested by: Reinhold Bader Introduction ~~~~~~~~~~~~ In his critique of the coarray features in the Fortran 2008 draft (J3/08-126), John Mellor-Crummey et al specifically mention issues with the NOTIFY and QUERY statements. This paper attempts to introduce changes to the feature which remove these issues. 1. Properties of image control statements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Both NOTIFY and QUERY are image control statements, but there are circumstances under which execution of these statements should not include the effect of a SYNC MEMORY statement: * execution of a NOTIFY should not have the effect of a SYNC MEMORY statement. Similar to LOCK and UNLOCK a one-way ordering of the segments with respect to the target image executing the corresponding QUERY should be sufficient * execution of a non-blocking QUERY statement with the resulting READY value being .FALSE. should have no influence on segment ordering. 2. Notification Events ~~~~~~~~~~~~~~~~~~~~~~ The number of invocations N(M --> T) and Q(M <-- T) is not part of the global state of the program, but always refers to a notification event: N(M --> T, E), Q(M <-- T, E). Such an event is an entity of a derived type EVENT_TYPE defined in the ISO_FORTRAN_ENV intrinsic module, and such an entity - similar to a lock - must always be a coarray (or, if coscalars make it into the TR, a coscalar). The programmer must declare an event and use this as an argument to both NOTIFY and QUERY, thereby assuring that existing notifications do not interfere with notifications and queries in library code, which would use distinct events. Hence, the example from 10-166, NOTE 2.5 could be modified as follows SUBROUTINE PROCESS(...) ... ! declarations TYPE(EVENT_TYPE), SAVE :: PROCESS_EVENT[] IF (THIS_IMAGE()==1) THEN DO I=1,100 ... ! Primary processing of column I NOTIFY(2, EVENT=PROCESS_EVENT) ! Done with column I END DO SYNC IMAGES(2) ELSE IF (THIS_IMAGE()==2) THEN DO I=1,100 QUERY(1, EVENT=PROCESS_EVENT) ! Wait until image 1 is done with column I ... ! Secondary processing of column I END DO SYNC IMAGES(1) END IF END SUBROUTINE PROCESS 3. Excess notifications ~~~~~~~~~~~~~~~~~~~~~~~ The excess of notifications over queries for a given event and a given pair of images should be limited to one. That is, while a program may complete with an excess of notifications, it would be disallowed to invoke a new N(M-->T,E) on an event while the corresponding query is still outstanding. Any situation where subsequent NOTIFY statements (without interleaved queries) are required on the same image pair can be treated by introducing multiple events, typically responsible for protecting different coarray entities from unsynchronized access. 4. Using team arguments ~~~~~~~~~~~~~~~~~~~~~~~ For conciseness (and if teams make it into the TR), it should be allowed to also use arguments of type IMAGE_TEAM instead of the image set in NOTIFY and QUERY statements. 5. Some final remarks ~~~~~~~~~~~~~~~~~~~~~ The NOTIFY and QUERY statements provide a more general load balancing synchronization facility than the corresponding UPC construct. In UPC, the upc_notify and upc_query are always collective; to avoid deadlocks it is not allowed to start a new notification while a previous one is still open. In Fortran, apart from the possibility to perform NOTIFY and QUERY for arbitrary subsets of images, it is also possible to construct a split-phase barrier by executing NOTIFY and QUERY with the same subset of images and that subset as image-set argument. By using different event variables, new notifications may be started before previous ones have completed without incurring deadlocks. In particular, using a split-phase barrier together with collective functions may provide improved performance if the collectives do not enforce synchronization at entry.