ISO/IEC JTC1/SC22/WG5 N1835

           Requirements for TR of further coarray features 

                              John Reid

                          10 September 2010

Resolution LV5 of the WG5 meeting in Las Vegas, Feb. 2008 reads:
"LV5.  Content and processing of TR on Enhanced Coarray Facilities
That WG5 declares that the content of the Technical Report on Enhanced
Coarray Facilities in Fortran is as shown in document J3/08-131r1.
Further, WG5 expects the TR to be published during the second quarter
of 2011."  

At the WG5 meeting in Las Vegas, Feb. 2008, the target date for 
publication was changed to November 2012 (see resolution LV5 and document
N1812), but there has been no discussion since 2008 on the technical
content. The aim of this paper is to explore alternatives to the 
features of J3/08-131r1, which are:

1) Collective intrinsic subroutines:
   CO_ALL
   CO_ANY
   CO_COUNT
   CO_MAXLOC
   CO_MAXVAL
   CO_MINLOC
   CO_MINVAL
   CO_PRODUCT
   CO_SUM

2) Teams and features that require teams: 
   Team formation and inquiry; FORM_TEAM, TEAM_IMAGES intrisics,
         and the IMAGE_TEAM type.
   SYNC TEAM statement
   TEAM specifiers in I/O statements
      
3) The NOTIFY and QUERY statements.

4) File connected on more than one image, except for the files
   preconnected to the units specified by OUTPUT_UNIT and ERROR_UNIT.

A draft TR, containing exactly these features, is visible as J3/10-166,
and this paper will use J3/10-166 as its base document. 

I hope that the technical content of the TR can be decided at the WG5 
meeting in June 2011. 

This paper is arranged as a set of proposals, each with a summary and
(where appropriate) its technical details. The titles are

Proposal 1. Change the set of collectives
Proposal 2. Add atomic compare-and-swap and other atomic subroutines
Proposal 3. Proposals from Bob Numrich with comments
Proposal 4. Add coscalars
Proposal 5. Allow asynchronous execution of the collectives
Proposal 6. Reconsider the handling of coarrays in a team
Proposal 7. Add coarray pointers
Proposal 8. Allow asymmetric allocatable and pointer objects
Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements
---------------------------------------------------------------------------

Proposal 1. Replace the set of collectives
Suggested by: Bill Long. 
Summary
Replace the collective intrinsic subroutines by the new set
   CO_BCAST
   CO_MAX
   CO_MIN
   CO_REDUCE
   CO_SUM
which are not image control statements.

The current collective subroutine CO_SUM lacks some
features that would improve its usability and improve performance. A
new version is desired with these enhancements:

1) The RESULT variable should be optional. If it is not present, the
result of the computation is assigned to the SOURCE argument.
Rationale: The current specification requires declaring a second
variable to be used for the RESULT, which is often unnecessary.

2) SOURCE, and RESULT if present, should be allowed to be
non-coarrays. Rationale: This significantly expands the potential
usability of the routine, particularly in the context of integrating
coarrays into existing codes. Internally, the routine could have a
coarray of a derived type with a component that is a pointer to the
supplied SOURCE or RESULT argument, and perform the computation using
that structure. As an optimization, for the case of scalar argument,
the routine could have an internal coarray into which the source is
copied at entry and from which the result is copied at completion.

3) Add a new optional argument, RESULT_IMAGE. If this is present, the
result is assigned only on the identified image, and not broadcast to
all the images. On all other images, the result variable becomes
undefined.  Rationale: This is a reasonably common usage, and
eliminating the broadcast improves performance.

The designs for CO_MAX and CO_MIN follow that of CO_SUM. Much of the
same infrastructure can be reused. CO_MAX and CO_MIN differ from
CO_SUM in that they allow arguments of type character, and do not
allow arguments of type complex.

The new intrinsic subroutine
    CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM])
would broadcast a value to all images of a team.

The new intrinsic subroutine
   CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE]) 
would provide a general routine for operations not currently covered. 
This subroutine could also handle arguments of derived type as long as the 
specified operation was defined for the type. The specification follows 
that for CO_SUM, with the addition of an OPERATION argument. 

Technical details

CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM])

Description. Broadcast of a value to all images of a team.

Class. Collective subroutine.

Arguments.

SOURCE shall be a coarray. It is an INTENT(INOUT) argument. SOURCE
becomes defined on all images of the team with the value of SOURCE on
image SOURCE_IMAGE.

SOURCE_IMAGE shall be type integer. It is an INTENT(IN) argument. Its
value shall be the image number of one of the images in the team.

TEAM (optional) shall be a scalar of type IMAGE_TEAM. It is an
INTENT(IN) argument that specifies the team for which the broadcast is
performed. If TEAM is not present, the team consists of all images.


Example. If SOURCE is the array [1, 5, 3] on image one, after
execution of CALL CO_BCAST(SOURCE,1) the value of SOURCE on all images
is [1, 5, 3].

........................................................

CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE])

Description. General reduction of elements on a team of images.

Class. Collective subroutine.

Arguments.

SOURCE shall be of a type for which the operation specified by the
OPERATION argument is defined. It is an INTENT(INOUT) argument. It may
be a scalar or an array. If it is a scalar, the computation result is
equal to a processor-dependent and image-dependent approximation to
the application of the operation specified by the OPERATION argument
to the values of SOURCE on all images of the team. If it is an array,
the value of the computation result is equal to a processor-dependent
and image-dependent approximation to the application of the operation
specified by the OPERATION argument to all the corresponding elements
of SOURCE on the images of the team. If RESULT is not present, value
of the computation result is assigned to SOURCE. If REULT is present,
SOURCE is not modified.

OPERATION shall be an external procedure that defines the binary,
commutative operation to be performed. The specified procedure shall
have two scalar arguments of the same type and type parameters as
SOURCE, and return a result of the same type and type parameters as
SOURCE. The result of executing the procedure is the value of
performing the intended operation with the two arguments as operands.

RESULT (optional) shall be of the same type, type parameters, and
shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present,
the value of the computation result is assigned to RESULT.

TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an
INTENT(IN) argument that specifies the team for which CO_SUM is
performed. If TEAM is not present, the team consists of all images.

RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN)
argument. Its value shall be the image number of one of the images in
the team. If RESULT_IMAGE is present and RESULT is present, the result
of the computation is assigned to RESULT on image RESULT_IMAGE and
RESULT on all other images becomes undefined. If RESULT_IMAGE is
present and RESULT is not present, the result of the computation is
assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other
images becomes undefined.

Example. If the number of images is two and SOURCE is the array [1, 5,
3] on one image and [4, 1, 6] on the other image, and MyADD is a
function that returns the sum of its two integer arguments, the value
of RESULT after executing the statement CALL CO_REDUCE(SOURCE, MyADD,

RESULT) is [5,6,9]."

.................................................

CO_SUM (SOURCE [, RESULT, TEAM, RESULT_IMAGE])

Description. Sum elements on a team of images.

Class. Collective subroutine.

Arguments.

SOURCE shall be of numeric type. It is an INTENT(INOUT) argument. It
may be a scalar or an array. If it is a scalar, the computation result
is equal to a processor-dependent and image-dependent approximation to
the sum of the value of SOURCE on all images of the team. If it is an
array, the value of the computation result is equal to a
processor-dependent and image-dependent approximation to the sum of
all the corresponding elements of SOURCE on the images of the team. If
RESULT is not present, value of the computation result is assigned to
SOURCE. If REULT is present, SOURCE is not modified.

RESULT (optional) shall be of the same type, type parameters, and
shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present,
the value of the computation result is assigned to RESULT.

TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an
INTENT(IN) argument that specifies the team for which CO_SUM is
performed. If TEAM is not present, the team consists of all images.

RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN)
argument. Its value shall be the image number of one of the images in
the team. If RESULT_IMAGE is present and RESULT is present, the result
of the computation is assigned to RESULT on image RESULT_IMAGE and
RESULT on all other images becomes undefined. If RESULT_IMAGE is
present and RESULT is not present, the result of the computation is
assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other
images becomes undefined.

Example. If the number of images is two and SOURCE is the array [1, 5,
3] on one image and [4, 1, 6] on the other image, the value of RESULT
after executing the statement CALL CO_SUM(SOURCE, RESULT) is [5,6,9]."

-------------------------------------------------------------------------

Proposal 2. Add atomic compare-and-swap and other atomic subroutines
Suggested by: Bill Long. 
Summary
Several people external to WG5, and several members of the committee,
have proposed that in addition to ATOMIC_DEFINE and ATOMIC_REF it is
very useful to add an atomic read-modify-write intrinsic. The most
basic of those in both theoretical works and also practical
implementations is the atomic compare-and-swap (CAS) intrinisc.
The basic operation of this intrinsic is:
   atomic_cas (atom, old, compare, new)
which performs atomically:
   old = atom
   if (old == compare) atom  = new

The following further atomic subroutines are suggested:
atomic_add
atomic_fadd
atomic_and
atomic_fand
atomic_or
atomic_for
atomic_xor
atomic_fxor
where the 'f' versions are the "fetch_and_" versions of the ones with 
out the 'f'.  These have existed (with different spelling) in the Cray 
coarray implementation from the beginning due to specific customer 
demands.  All take integer arguments.  Having standardized and portable 
names would be good. 

Technical Details

 ATOMIC_CAS (ATOM, OLD, COMPARE, NEW)

 Description. Conditionally swap values atomically.

 Class.  Atomic subroutine.

 Arguments.

 ATOM shall be scalar and of type integer with kind ATOMIC_INT_KIND or
           of type logical with kind ATOMIC_LOGICAL_KIND, where
           ATOMIC_INT_KIND and ATOMIC_LOGICAL_KIND are the named
           constants in the intrinsic module ISO_FORTRAN_ENV. It is an
           INTENT (INOUT) argument. If the value of ATOM is equal to
           the value of COMPARE, ATOM becomes defined with the value
           of INT (NEW, ATOMIC_INT_KIND) if it is of type integer, and
           with the value of NEW if it of type logical.

 OLD shall be scalar and of the same type as ATOM. It is an INTENT
           (OUT) argument. It becomes defined with the value of INT
           (ATOMC, KIND (OLD)) if ATOM is of type integer, and the
           value of ATOMC if ATOM is of type logical, where ATOMC has
           the same type and KIND as ATOM and has the value of ATOM
           used for the compare operation.

 COMPARE  shall be scalar and of the same type and kind as ATOM.
          It is an INTENT(IN) argument.

 NEW      shall be scalar and of the same type as ATOM. It is an
          INTENT(IN) argument.


 Example. CALL ATOMIC_CAS(I[3], OLD, Z, 1) causes I on image 3 to
          become defined with the value 1 if its value is that of Z,
          and OLD to become defined with the value of I on image 3
          prior to the comparison."

------------------------------------------------------------------------

Proposal 3. Proposals from Bob Numrich with comments
Suggested by: Bob Numrich 

a. The intrinsic function this_image()

The function this_image should allow a scalar return value for coarray 
arguments with just one codimension:

integer :: me
real    :: x[*]

me = this_image(x)

Internally the function may continue to think it is returning an array of 
length one, but the programmer should not be penalized for that.  Let the 
value on the left side of the assignment statement be a scalar.  At most, 
issue a warning at compile time.  I hit this problem every time I write 
new code.  It is embarrassing trying to explain it to a new coarray 
programmer.

b. Remove restrictions on derived types with coarray components

Remove most, if not all, the restrictions on derived types with coarray 
components. I can't remember all the restrictions, but I think there are 
lots of them. For those restrictions that absolutely cannot be removed, 
provide a clear explanation of why.

In particular, remove the restriction that a child type can add a coarray 
component only if its parent has a coarray component. It messes up 
inheritance by, for example, forcing every abstract type to contain a dummy 
coarray component just in case somebody wants to extend it by adding a 
coarray component, which will often be the reason it is being extended.

c. Alternative sync statement

There only needs to be one sync() statement with different arguments:

    integer :: list(:)

    sync() or sync   ! sync with all images; behaves just like sync all

    sync(list)       ! sync with images in list(:); 
                     ! behaves like sync images(list)

    sync(memory)     ! local memory sync; behaves like sync memory

Existing sync statements remain valid if the programmer wants to use them.  
See proposal f below for team sync.

d.  Functions with side effects

Programmers may write functions with side effects, such as internal syncs 
or allocation of coarrays, but they have no guarantee that they will be 
executed in the same order as listed on the program statement or executed 
at all.  The ordering of segments assumed by the programmer is therefore 
broken.  Functions with this kind of side effect need a new attribute that 
requires such functions to be executed and executed in the order written 
in the program statement. The attribute IMPURE would be a natural choice 
had it not already been used to mean something else.

Functions should be allowed to return objects with coarray components.
Constructors require this capability.  A workaround using an overloaded 
assignment statement is very awkward and frankly embarrassing.

e.  Collectives

Collectives should be part of a support library not part of the language.

If we make them part of the language, we need to be very careful how we 
define them.  The UPC people have been arguing about them for years.  
Duplicating MPI collectives should not be the goal.  If the coarray model 
is compatible with MPI, why not just use the MPI collectives?

If we do include them, collectives should be functions (with side effects 
as in proposal d):

  s = co_sum(x)

They should be simple, mimicking the normal functions

  s = sum(x)

No long list of arguments please.  See proposal f for collectives within 
teams.

Every image must invoke the function, and every image gets the result on 
return from the function.  The argument x need not be a coarray.  Since 
they are collective, they imply a segment boundary upon entry.  Each image 
can return from the function independently as soon as it receives the 
value of the result.

I would like to see some evidence that nonblocking collectives really make 
a difference in the overall performance of a real application, not just a 
kernel.

f.  Teams

Remove Teams completely from the proposed extensions.

The ability to couple two coarray codes already exists using MPI 
intercommunicators or one of many frameworks out there.  These frameworks 
just need to allow coarray codes as components.

Teams are a very big addition to the language, and we should hasten slowly. 
A coarray code should be the same whether it is run alone or run as a team 
coupled with another coarray code.  With the current definition of teams 
this is not true.  Both codes will need to be altered to run as teams.  
Dereferencing codimensions relative to a team is a very big problem.  
Leaving it up to the programmer is very, very difficult and very error 
prone.  All coarray references will need to be changed so they are relative 
to the team.  Symmetric memory will be broken if we allow allocation within 
teams. We should not do teams.

If we go ahead with teams, somebody has to figure out how codimensions are 
dereferenced relative to a team.  Otherwise, teams are pretty useless.

If a coarray is declared in code only executed by a team, is the coarray 
visible across all images or just within the team?

If we allow allocation within a team, the allocate should be a method 
associated with the team object:

Type(Team_Object) :: myTeam
real,allocatable  :: x[:,:]

stat = myTeam%allocate(x[p,*])

This allocate implies a sync within the team. A coarray allocated this way 
could be given a state that includes information on how to dereference 
codimensions. Do image indices then start with one or start with the first 
image in the team?  How do we deal with asymmetric heaps?  Are coarrays 
allocated by one team visible to other teams?  How?

Collective functions within a team should be associated with team objects:

    Type(Team_Object) :: myTeam

    s = myTeam%sum(x)

Synchronization within a team should be a procedure associated with the 
team object:

        Type(Team_Object)  :: myTeam
        myTeam%sync()        ! sync with images in myTeam
        myTeam%sync(list)    ! sync with a subset of images in myTeam

Other associated functions:

       myTeam%isMyTeam()
       myTeam%myTeamIndex()
       myTeam%teamList()

But I repeat, we should not add teams.

g.  Notify/Query

We should hasten slowly with these statements.  The current definition is 
probably wrong. There probably needs to be some sort of tag associated 
with these statements making them look more like events.

h. Locks

Add some logical functions associated with lock variables:

type(Lock_Object) :: lck[*]

if(lck%isMyLock()) then
  unlock(lck)
end if

Otherwise, all images must attempt to unlock the variable and deal with 
an error code.  In the same way, the function IsLocked() determines if a 
lock is already locked:

if(.not. lck%isLocked()) then
  lock(lck)
end if

This function also allows spinning on a lock until it is free.

i.  MPI, OpenMp, UPC, CUDA compatibility

Is the coarray model compatible with other programming models?

 
-------------------------------------------------------------------------

Proposal 4. Add coscalars
Suggested by: Reinhold Bader 
Summary
Add "coscalars". A coscalar exists on a single image and is referenced
by appending []. It may be a scalar or an array. It may be a pointer,
allocatable, or neither. When a pointer or allocatable coscalar is allocated, 
the programmer can choose the host image; otherwise, the host image is 
processor dependent and does not change during the lifetime of the coscalar. 
The main application is for program-wide linked lists that are modified 
rarely, but accessed frequently. There are advantages in having coscalar locks.

Technical Details

1. Introduction:
~~~~~~~~~~~~~~~~
 
The intent of this paper is to bring forward arguments in favour of
including unsymmetric shared entities (denoted as "coscalars") as part 
of the Technical Report on Enhanced Parallel Computing Facilities. 
An informal description of the desired features is provided, which
attempts to obey the following constraints:

* coscalar functionality is kept as orthogonal as possible to 
  coarrays. Having few interactions between the two features 
  should allow to minimize the implementation effort.
* coscalar syntax and semantics follows the design principles for 
  coarrays as far as is possible with respect to visual indication 
  of communication and with respect to synchronization semantics.

A key feature is the possibility to allow subobjects of a derived
type entity which may be hosted on an image different from that
hosting the parent object. This is achieved by introducing coscalar
pointers to coscalars; see sections 5 and 6.1 for details.

The suggested language elements are analogous to shared scalar 
entities and shared pointers to shared entities in UPC, however 
with additional provisions and restrictions to increase safety 
of use, as well as suggestions for performance tuning.


2. Complex data structures and their scalability limitations:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

The main scenario targeted by the feature described in this paper is 
the case of general, program-wide data structures like lists, deques, 
binary trees, oct-trees etc. which are modified once or only rarely, 
but traversed and referenced often throughout the execution of the
program. For example, MPI codes which need to perform dynamic load 
balancing during program execution typically implement this kind of
concept manually (and with great programming effort). 

While it is possible to implement such concepts using the coarray
facilities defined in the base language (for example, by using
allocatable components of a coarray of suitable derived type), this is 
still significantly more complex to program and maintain than 
e.g., an OpenMP tasking code, and furthermore may require repeated
reallocation of coarrays if the size of the data structure is not 
a priori known, thereby incurring repeated program-wide 
synchronization and hence scalability issues. 

For the scenario indicated above, the arguments against the use
of shared pointers becomes less relevant since 
* the workload should typically be considerably larger 
  than the latency for accessing a shared pointer is.
* the double latency incurred for dereferencing a shared pointer
  as well as accessing its target may be reduced by caching the
  descriptor information on all images requiring this information
  during synchronization phases.


3. Coscalar declaration, definition and reference:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar entity is declared in one of the following ways: 

real, codimension[] :: cs1    ! real scalar coscalar
real :: cs2[]                 ! alternative declaration
real, target :: ca1(ndim)[]   ! real array coscalar

Syntactically, the difference between coscalar and coarray is that
a coscalar does not specify a coshape; the corresponding semantics
is that the declared entity is shared between all images.
Notwithstanding, there is one image which is considered the "hosting
image" of the coscalar. The hosting image can be identified via the 
image_index() intrinsic, thereby allowing the programmer
to tune code for efficiency of access. For a statically declared
coscalar the hosting image is processor-dependent; it is 
the same image throughout the coscalar's existence. 

Every definition or reference of a coscalar requires specification 
of the angular brackets:

x = cs1[]
cs2[] = a[4]

This maintains the visual indication of the occurrence of 
communication already known from coarrays; for the purpose of
notation it is assumed that it is the exception rather than the
rule that the hosting image references or defines a coscalar.
On the other hand, an implementation may generate multiple 
code versions depending on whether the access is or is not
local, thereby assuring improved speed for local accesses.

The sequence of definitions and references of coscalars follows 
the same synchronization rules as coarrays do. For example, in

if (this_image() == 1) then
   cs1[] = ...
   sync images (*)
else
   sync images(1) 
   x = cs1[]
end if

the SYNC IMAGES statements are required to prevent a data race between
the single image which defines cs1 and all the others which reference
it. 


4. Allocatable coscalars:
~~~~~~~~~~~~~~~~~~~~~~~~~

To enable control of memory locality by the programmer, a coscalar
with the allocatable attribute can be allocated on a specific image:

real, allocatable, codimension[] :: cs3 
:
allocate(cs3, image=4)

where the IMAGE argument to the ALLOCATE statement is obligatory; on 
images other than the one specified the statement will have no effect
(and it is of course a violation of the synchronization rules if
 two images attempt to allocate an unallocated entity in unordered 
 segments). 
A subsequent call to this_image(cs3) or allocated(cs3) on any image
in a segment executed after the one during which the allocation is
performed will return the values 4 and .TRUE., respectively. 
It is required that the hosting image perform the deallocation:

deallocate(cs3)

Executing this statement on images other than that hosting
the coscalar will not have any effect. If applied to coscalars
(and local entities) only, neither the ALLOCATE nor the DEALLOCATE
statements will perform any synchronization. This improves 
scalability especially if only small subsets of images 
(or only teams) need to access the coscalar.

The ALLOCATED intrinsic may also be used on images which do not 
host the pointer or its target; it is atomic in that it may
be executed in a segment unordered with respect to that performing
the allocation or deallocation; however it is the programmer's
responsibility to properly deal with race conditions which
may result from such a use, especially in the case of 
deallocation. 


5. Pointers to coscalars:
~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar pointer to a shared entity is declared by specifying
the pointer attribute for a coscalar:

real, pointer :: cp(:)[]
type(team_array), pointer, codimension[] :: tp(:)

Such an entity is itself a coscalar (with a processor dependent
hosting image unless it is a type component), and 
it may be pointer associated with a shared entity with the 
target attribute:

if (this_image() == R) cp[] => ca1(:)

Since image R may be distinct from either the image hosting the
coscalar pointer, or the image hosting its target, the above
pointer assignment statement will involve up to three images;
note that a similar situation also can occur when using regular
assignment with differently coindexed objects on both sides of
the assignment. Image control statements to perform synchronization
prior to subsequent references or definitions only are required
against the image R executing the pointer assignment. Also, it
would be allowed to define the target in segments unordered with
respect to the one executing the above pointer assignment, 
since only transfer of a descriptor is required (for this reason
the angular brackets are omitted from the right hand side); 
in this case synchronization would need to include the image
performing such a definition. 

Subsequent references to cp[] then go to the target:

x(3) = cp(3)[]  ! same as x(3) = ca1(3)[] 

(One could consider also allowing coindexed objects as targets.)

It is possible to dynamically allocate the target on a single image: 

allocate(tp(num_images()), image=1)

As with allocatable entities, deallocation later must be performed 
on the hosting image.

The NULL() and NULLIFY() intrinsics are also available for coscalar
pointers; these may also be used on images which do not host the
pointer or its target; if they do so they must be called in a
segment ordered with respect to any segment changing the association
or definition status of the pointer. 
The ASSOCIATED() intrinsic, similar to ALLOCATED() is atomic; in 
its two argument form both arguments must be coscalars if one is. 
The target's hosting image may be determined using the IMAGE_INDEX()
intrinsic on the coscalar pointer associated with it. Finally, just
as for regular pointers, it is possible to specify the CONTIGUOUS 
attribute for a coscalar pointer, in which case its target must be
simply contiguous.


6. Derived types:
~~~~~~~~~~~~~~~~~

The desired properties of shared general data structures rests on the 
possibility to define coscalar subobjects which may be hosted on 
an image different from that hosting the parent data object. 

A number of restrictions are required to assure that no remote
allocations or deallocations are needed wherever dynamic type
components are involved.

Combinations of coscalars and coarrays are disallowed in the 
derived type context i.e. a coarray may not have coscalar type
components, and a coscalar may not have coarray type components. 

6.1 Distributed structures
~~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar may appear as a type component provided it has the POINTER
attribute. This allows for a directory-like programming style:

type :: team_array
  real, pointer, contiguous :: x(:)[]
end type

type(team_array), allocatable :: o(:)[]

allocate(o(num_images()), image=myteam_first_image)
sync team (myteam)
if (member_of(myteam)) allocate(o(this_image())%x(localsize), &
                                image=this_image())
sync team (myteam)   ! synchronize across team only

After the SYNC TEAM, an image in the team may now define 

o(any_other_team_index)[]%x(:)[] = ...

where the subobject is hosted on image any_other_team_index (which may be
an image other than that allocating the entity o). 

The image hosting the coscalar pointer component (not its target!)
is the image hosting the parent object. 

For simplicity of implementation it is suggested that the parent 
object of such a type is required to be a coscalar.


6.2 local (dynamic) type components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Given a type definition with (regular) pointer or allocatable components, 
it is possible to declare coscalars of such a type; any subobject of 
such an entity is also a coscalar, and its allocation or association
status may only be changed on the image hosting its parent object:

type :: ptr_type
  real, pointer :: x(:)
end type

type(ptr_type) :: o[]
real, target :: y(5)

if (image_index(o) == this_image()) then
  y = ...
  o[]%p => y
end if
sync all
y = o[]%p  ! scatter

For pointer components this ensures that pointer association with
a local object is well defined.


type :: alloc_type
  real, allocatable :: x(:)
end type

type(alloc_type) :: a[]

allocate(a%x(5), image=image_index(a)) ! a%x is a coscalar
sync all
if (this_image() == 1) then
  a(:)[] = ...
end if

For allocatable components this ensures that no remote deallocation
is required when the object goes out of scope.


6.3 polymorphic entities/subobjects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The case of either polymorphic coscalars, or polymorphic components
of coscalars presumably will require similar restrictions as the 
coarray case. Details of these will need to be figured out.


7. Tuning considerations
~~~~~~~~~~~~~~~~~~~~~~~~

The envisaged main usage scenario for coscalars is the situation 
where an object is generated once and then referenced multiple times on 
some or all images (one or few writes, many reads). This section
contains some thoughts on how tuning of code using coscalars could be 
performed.

7.1 Caching
~~~~~~~~~~~

Therefore, while a coscalar always has a hosting image, an implementation
may choose to cache the entity or parts of an entity on (some) other images.
If so, some garbage collection scheme will be required to dispose of 
the cached copies once the coscalar is deallocated or leaves scope. 

The manner in which caching is performed will be processor dependent, 
but the expectation will be that a high quality implementation will
perform caching
* on coscalar pointers to reduce access latency
* on sufficiently small items

One could consider providing an additional collective intrinsic
to enforce caching. Otherwise, implementation-dependent caching
would be controlled by execution of an image control statement. 
Finally, especially for the case in which locality control is exerted
by the programmer (see below), or if it is known that an entity 
requires a large amount of memory resources (perhaps only available
to a subset of images), an attribute can be specified to 
suppress caching:

type(alloc_type), uncached :: a[]


7.2 Locality control
~~~~~~~~~~~~~~~~~~~~

As a convenience for implementation of load-balancing algorithms, 
a statement for changing the hosting image of an allocatable
coscalar or a coscalar pointer target is provided:

relocate (a, image=4[, team=...][, sync='YES|NO'])

would change the hosting image of a to 4. This statement must be
executed collectively by all images (of a team). If a team argument 
is present, the specified image as well as the image hosting the
entity to be relocated must be a member of the team.  
The statement implies synchronization of all images executing it
unless the SYNC argument is specified with a value of 'NO'.
If this is the case, it is the programmer's responsibility to
insert synchronization statements before subsequent 
references/definitions of the entity. 

Relocation only applies to the parent object in case the object 
has coscalar pointer components; the latter's targets remain on 
their hosting images. 

Regular pointer components of a relocated coscalar become undefined, 
and execution of a relocate statement may in this case induce a 
memory leak. 

An allocatable component of a relocated coscalar is reallocated
on the new hosting image, and its content is transferred to 
the relocated entity.
 

7.3 Note on symmetric heaps
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In many cases, the intent for distributed structures is to achieve
a balanced filling of the memory across images. Hence, an 
implementation might be able to use a symmetric heap even in 
this case, and allocate this in moderately large blocks, with 
an additional level of indirection for accessing the data items
in the structure. For such implementations, a compiler directive
allowing the programmer to indicate a suitable block size might 
be useful. 


8. Coscalars and subprograms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

8.1 subprogram-local coscalars
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similar to coarrays, coscalars declared inside subprograms are
required to have the SAVE attribute, and automatic (array) coscalars
are not allowed, since the need to have a well-defined hosting image
would imply a need for synchronization. 

However, coscalars may be locally declared in a subprogram
without SAVE if they are allocatable or have the pointer attribute; 
it is then the programmer's responsibility to ensure valid accesses by
performing allocation, deallocation and inserting image control statements.
An allocatable coscalar is automatically deallocated once the image 
hosting it completes execution of the subprogram.

8.2 coscalar dummy arguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A subprogram dummy argument may be a coscalar, in which case
the actual argument must be a coscalar (the latter is then provided
without the angular brackets). Similar to the coarray case, restrictions
are in place which assure that no copyin/out occurs. The dummy arguments' 
hosting image is the same as that of the actual argument.

If a coscalar dummy argument has the POINTER or ALLOCATABLE attribute,
the actual argument must be a coscalar with the same attribute.

8.3 Coscalar actual arguments matching a non-coscalar dummy argument
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the dummy argument is not a coscalar, the actual argument may
be a coscalar anyway, in which case typically copy-in/out will
be required. In this case the same additional synchronization rule
applies for modifiable arguments as for the corresponding case
of an coindexed actual argument. The actual argument must specify
the angular brackets. 

8.4 Generic interfaces
~~~~~~~~~~~~~~~~~~~~~~

Similar to coarrays, no generic disambiguation is possible with 
respect to coscalar arguments.



9. Application to locks, teams and atomic procedures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the present standard, lock variables are required to be coarrays, 
which may lead to misinterpretations (typically, only a subset of
the num_images() lock variables of a scalar coarray are actually required). 
Using coscalars, use of locks as well as the team abstraction could be
handled more elegantly:

type(lock_type) :: my_lock[]
type(image_team) :: my_team[]

Furthermore, this also facilitates using locks as components of 
data structures:

type :: container
  type(lock_type) :: lk
  type(data) :: protected_stuff
end type

in which case any entity x of type container must be a coscalar, so 
that x[]%lk also is a coscalar. 

Similarly, atomic subroutines should be extended to allow scalar coscalars
as arguments.

Requiring teams to be coscalars instead of regular local entities
should provide advantages to both implementors and programmers:
* for good scalability, teams can internally make use of coscalar 
  pointer components especially in the case of large image counts
* handling teams is much more transparent and intuitive if they're
  coscalars; the usage pattern (write once, use often) fits perfectly, 
  and if cross-team communication should be supported, say

  with team(t1)
    a[i] = b[j]@t2
  end with team

  where t1 and t2 are teams not sharing any image, the shared semantics
  allows one access team information across team boundaries, something
  not provided by the present draft.


10. An example: binary tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the following type definition 

type :: tree
  type(lock_type) :: lk
  type(content) :: entry         
! entities of type content have < and possibly assignment overloaded
  logical :: defined = .false.
  type(tree), pointer :: left[] => null()
  type(tree), pointer :: right[] => null()
end type


Concurrent population of such an entity might be performed using
the following subprogram, which must be called with the same
"this" argument on each image:

recursive subroutine insert(this, stuff)
  type(tree), intent(inout) :: this[]  ! must be a coscalar
                                       ! since we hand coscalars in
                                       ! and so that this[]%lk is
  type(content), intent(in) :: stuff  
  lock(this%lk)
  if (this[]%defined) then
    unlock(this%lk)
    if (this[]%entry > stuff) then
      call insert(this%left, stuff)
    else
      call insert(this%right, stuff)
    end if
  else
! stuff goes to an entry possibly hosted by another image ...
    this[]%entry = stuff
    this[]%defined = .true.
! ... but I get to host the siblings ...
    allocate(this%left, image=this_image())
    allocate(this%right, image=this_image())
! caching of this[]%left and this[]%right to the new target is 
! probably a good idea
    unlock(this%lk)
  end if
end subroutine insert


After populating the data structure the workload can be processed via

recursive subroutine traverse(this, p)
  type(tree), intent(inout) :: this[]
  type(params), intent(in) :: p
! uses a subroutine operation() with non-coscalar dummy arguments to modify entries.
  
  if (this[]%defined) then
    if (image_index(this) == this_image()) call operation(this%entry, p) 
    call traverse(this%left, p)
    call traverse(this%right, p)
  end if
end subroutine
  
Note that if the calls to traverse() occur in segments ordered 
with respect to the ones calling insert(), no race conditions occur.
Since each image only performs computation on the part of the tree 
hosted by it, traverse() should scale well if operation() is 
sufficiently expensive compared to the coscalar pointer lookup. 
For complete processing, traverse() must be called by all images
which previously called insert(). It is not required that insert() 
be executed by all images.


Acknowledgement: 
~~~~~~~~~~~~~~~~

Apart from the conceptual derivation from UPC, the basic ideas 
presented here are a subset from John Mellor-Crummey's papers 
on his "CAF 2.0" vision; some modifications were done to 
improve the integration with the language, as well as to enable
the programmer to perform optimization through locality control.


Comment from Jim Xia
I don't like this name.  It's very confusing as people might think 
this refers to a coarray being scalar.  So how about a new attribute 
SINGLE or SHARED?  I know SHARED is going to be confusing as well 
to people who are familiar with UPC.
Reply from Reinhold
In choosing this, I started from the assumption that co- always refers to
something "shared" or "sharable". Since coarrays have a corank, it seemed
quite natural to call a corank zero entity a coscalar.

-------------------------------------------------------------------------

Proposal 5. Allow asynchronous execution of the collectives
Suggested by: Reinhold Bader
Allow asynchronous execution of the collectives. This would be redundant
if proposal 1 is adopted. 

-------------------------------------------------------------------------

Proposal 6. Reconsider the handling of coarrays in a team
Suggested by: Reinhold Bader
Reconsider the handling of coarrays in a team. Desirable features are 
allocation within a team and a construct that establishes an execution 
context for a team.

Issue (A): multiple objects with the same name on overlapping teams
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A problem not solved by either John Mellor-Crummey's team concept, nor
by Bob Numrich's team%allocate function, is that uniqueness of
objects is violated. That is, for an entity

real, allocatable :: a[:]

and the situation where teams b and c overlap, the code

with team b
  allocate(a[*])
end with team
with team c
  allocate(a[*])
end with team

creates two distinct objects which commonly exist on the overlapping
set of images notwithstanding that they have the same identifier. 
While this could be tolerated due to the above syntax for team execution, 
or by providing a notation for specifying team arguments on coarrays, 
I think it is potentially rather confusing for the programmer; it also 
seems to some extent to break the concept of local identifier, and as
a consequence may be difficult to integrate into the language.

Issue (B): reindexing in team execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From the point of view of composability of existing coarray software
it would be very nice if subprograms which are executed by a team 
(such as via a WITH TEAM block suggested by Mellor-Crummey) 
perform renumbering of coindices as well as image set arguments to 
image control statements and coarray-related intrinsics. However, 
corank 2 and higher or coarrays with lower cobound unequal to 1 
introduce a reindexing of their own, and it appears that this 
cannot be brought into harmony with the team-related reindexing.

Also, even if it were possible to enable this functionality (e.g.
by introducing suitable restrictions on the kinds of coarrays
available for teamed execution), this would preclude access to 
coarray data on an image which is not a member of the presently 
executing team.

Issue (C): remove teams altogether
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If it is decided that collective functions are to be included in the
TR it is nevertheless not desirable to entirely remove teams, 
since reductions on subsets of images are a very useful feature,
and the team concept would support an optimized implementation of 
this. Similarly, due to the fact that the most efficient parallel
I/O profile of large scale parallel programs makes use of image 
subsets typically an order of magnitude smaller than the complete
set of images, the availability of teams in the context of I/O 
is considered important. 

For this reason, even if it is decided that extending the handling of
coarrays to teams is not possible inside the scope of the TR, 
teams should still be included to cover the above functionality, 
perhaps with minor changes so as to not stand in the way of adding
coarray-related extensions later on.
  

An attempt to (partially) resolve issues (A) and (B)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The concept of execution context is introduced, which serves to 
localize - as far as possible - coarray functionality to the 
executing team. The intent is to bind all coarray 
entities to their execution contexts; the need to be able to
declare context-local entities implies that there will two
ways available to the programmer for creating non-default
execution contexts:

* team-parameterized BLOCK construct
* team-parameterized subprograms

For an execution context, the requirement is that all images
of the specified team are required to execute it, and any
image which is not member of the team and attempts to execute
the context anyway  will simply transfer control to the 
first statement after the end of the context. 

As an example, consider the following code:

program coupled
  type(image_team) :: ocean_team[*], atmosphere_team[*]

  : ! define integer value split
  if (this_image() < split) then
     call form_team(team_ocean[1])
  else
     call form_team(team_atmosphere[1])
  end if
  
  block with team ocean_team[1]
! only images from ocean_team execute this
    type(ocean_data), allocatable :: my_ocean(:, :)[:]
    allocate(my_ocean(n1, n2)[*])
    : ! work on ocean    
  end block


  block with team atmosphere_team[1] 
! only images from ocean_team execute this
    type(atm_data), allocatable :: my_atm(:, :)[:]
    allocate(my_atm(n3, n4)[*])
    : ! work on atmosphere
    ... = num_images()  ! size of atmosphere_team
    sync all            ! synchronizes across atmosphere_team only
    : 
  end block
end program

Inside the BLOCK construct executed by atmosphere_team, the coarray
my_atm exists on images split+1, ..., num_images() from the enclosing
execution context. However, execution of coarray intrinsics or image
control statements inside the block (in the form defined by F2008)
always refers to the local execution context. Access to the data 
in coarray my_ocean is not possible from the execution context 
defined by the atmosphere_team block. At the end of the BLOCK 
construct, the usual rules for deallocation of allocatable entities
etc. apply.  

For a team-parameterized subprogram the team will need to be part
of the subprogram's interface, since presumably the compiler will
need to propagate information contained in the team to any
coarray related functionality inside the subprogram body. 

Subprogram calls or block execution without a team parameter
are performed using the caller's execution context.

The next important question is: how are coarrays defined in the 
enclosing execution context treated inside the local one?

The (not entirely satisfactory) answer given here is: 

* local definitions and references of such entities are trivially
  resolved (such entities are defined on a superset of the team image
  set, so there can be no problem).
* coindexed definitions and references use the addressing of the 
  execution context to which the entity is bound. In particular, 
  this allows a team to read data which have been generated by
  another concurrently running team.  

As a consequence it will sometimes be necessary to 

* identify an entities' execution context. This could e.g., be done via
  an intrinsic function which returns a pointer to the team in which 
  an entity has been created: 

  type(team_image), pointer :: p[], q[]
  p[] => team_of(x)

* be able to synchronize images from different teams against each other.
  This could be done by adding optional TEAM arguments to the image control
  statements:

  ! on team p
  if (this_image() == 1)  sync images (*, TEAM=q)

  would, for example, synchronize local image 1 of team p with all images 
  of team q without pairwise synchronization of images in TEAM q, and

  sync all ( TEAM=p, TEAM=q )

  would perform pairwise synchronization of all images in teams p and q.

One reason the answer is not entirely satisfactory is that it is the 
programmer's responsibility to resolve the mismatch between coindexing of
entities defined in the local and enclosing contexts, respectively. Supporting
intrinsic functions should of course be provided, for example

global_image_index = TEAM_IMAGES(team_of(x), local_image_index)

and existing intrinsics may need to be extended with an additional TEAM
argument (note that there's a wart with IMAGE_INDEX(), the data argument
version of which will probably need to be split off into a separate 
intrinsic).

Introduction of additional restrictions will be required, for example that
coarray entities from the enclosing execution context with the
ALLOCATABLE attribute may not be allocated or deallocated inside 
the newly created context, etc.

A more radical suggestion, which does provide a satisfactory solution from
the ease-of-use point of view and reduces the implementation effort
at minor (?) loss in functionality is to entirely prohibit coindexed accesses 
to entities from enclosing execution contexts. Local accesses would still 
be allowed, enabling updates to the local portion of such a coarray for later
processing by other images. If still desired, the additional functionality
could be introduced post TR.

-------------------------------------------------------------------------

Proposal 7. Add coarray pointers
Suggested by: Jim Xia 
Add coarray pointers, requiring that the target of the pointer be a local 
coarray.

The primary motivation for this item is to allow coarrays to be used in a 
function result.  One example is to allow a derived type with allocatable 
coarray components to be used as a target to be associated with a pointer.  
It seems allowing the POINTER attribute on coarrays is a reasonable solution.

I consider this proposal comprised of two separate parts.  The first part 
is to allow a derived type with allocatable coarray components to be used 
as a target to be associated with a pointer.  The following is the original 
example when I began to think of allowing pointer coarrays:

From a user point of view, I'd like to allow the following practice

TYPE global_field
    REAL, allocatable :: f(:)[:]
END TYPE

TYPE my_field_type
    type(global_field), pointer :: global => null()
    REAL, allocatable :: local(:)
...
   ! type bound operations
END TYPE

Where my_field_type stores a local copy of global field and can be updated 
frequently (e.g. intermediate computational results etc).  The global field 
(as a coarray) is only updated whenever there is a need.  The type bound 
operations can be functions returning objects of this data type as long as 
there is no update on the global field (i,e. there is no violation of 
segmentation rules).  Note this can also be used as a strategy to re-mesh 
the global field when it is required.  The remeshing is encapsulated by 
my_field_type to hide the information from users (e.g. whenever do the 
global field update).  This declaration, however, is currently not allowed.


The second part is the coarray pointers: I'd like to suggest the following 
syntax

REAL, POINTER :: X(:)[:]

X can be allocated, or be associated with another coarray target.

Allocating X is the same as allocating an allocatable coarray.

ALLOCATE (X(M)[*])

ALLOCATE and DEALLOCATE of X is considered collective operations and 
same synchronizations for allocatable coarrays apply here.

X can also be assigned to a coarray target as in

X => Y

where Y is required to be a coarray target.  In concept, each image has 
its own X associated with a target of its own Y, so there shouldn't be 
any problems. 

-------------------------------------------------------------------------

Proposal 8. Allow asymmetric allocatable and pointer objects
Suggested by: Bill Long 
Allow asymmetric allocatable and pointer objects, declared with deferred 
shape and explicit coshape, e.g.
     REAL, ALLOCATABLE :: A(:)[*]
This provides a mechanism for avoiding the artificial structure workaround
and gives users a way to create coarrays that are restricted to a team.

The down side is that you cannot call this thing an "allocatable coarray" 
without having significant side effects elsewhere in the standard. 
[This was a major reason the idea was dropped previously.] 
Basically, the object is an orphaned component, but there are no terms 
for that either.

-------------------------------------------------------------------------

Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements
Suggested by: Reinhold Bader 

Introduction
~~~~~~~~~~~~

In his critique of the coarray features in the Fortran 2008 draft (J3/08-126), 
John Mellor-Crummey et al specifically mention issues with the NOTIFY and
QUERY statements. This paper attempts to introduce changes to the feature
which remove these issues. 


1. Properties of image control statements 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both NOTIFY and QUERY are image control statements, but there are
circumstances under which execution of these statements should not
include the effect of a SYNC MEMORY statement:
* execution of a NOTIFY should not have the effect of a 
  SYNC MEMORY statement. Similar to LOCK and UNLOCK a one-way
  ordering of the segments with respect to the target image 
  executing the corresponding QUERY should be sufficient
* execution of a non-blocking QUERY statement with the resulting
  READY value being .FALSE. should have no influence on segment 
  ordering.

2. Notification Events
~~~~~~~~~~~~~~~~~~~~~~

The number of invocations N(M --> T) and Q(M <-- T) is not part
of the global state of the program, but always refers to a
notification event: N(M --> T, E), Q(M <-- T, E). Such an event
is an entity of a derived type EVENT_TYPE defined in the ISO_FORTRAN_ENV
intrinsic module, and such an entity - similar to a lock - must always
be a coarray (or, if coscalars make it into the TR, a coscalar).

The programmer must declare an event and use this as an argument to
both NOTIFY and QUERY, thereby assuring that existing notifications 
do not interfere with notifications and queries in library code, 
which would use distinct events.

Hence, the example from 10-166, NOTE 2.5 could be modified as follows


SUBROUTINE PROCESS(...)
  ... ! declarations
  TYPE(EVENT_TYPE), SAVE :: PROCESS_EVENT[]

  IF (THIS_IMAGE()==1) THEN
    DO I=1,100
       ... ! Primary processing of column I
       NOTIFY(2, EVENT=PROCESS_EVENT) ! Done with column I
    END DO
    SYNC IMAGES(2)
  ELSE IF (THIS_IMAGE()==2) THEN
    DO I=1,100
      QUERY(1, EVENT=PROCESS_EVENT)    
                ! Wait until image 1 is done with column I
       ... ! Secondary processing of column I
    END DO
    SYNC IMAGES(1)
  END IF
END SUBROUTINE PROCESS

3. Excess notifications
~~~~~~~~~~~~~~~~~~~~~~~

The excess of notifications over queries for a given event and a given 
pair of images should be limited to one. That is, while a program may 
complete with an excess of notifications, it would be disallowed to 
invoke a new N(M-->T,E) on an event while the corresponding query is 
still outstanding. Any situation where subsequent NOTIFY statements
(without interleaved queries) are required on the same image pair
can be treated by introducing multiple events, typically responsible
for protecting different coarray entities from unsynchronized access.


4. Using team arguments
~~~~~~~~~~~~~~~~~~~~~~~

For conciseness (and if teams make it into the TR), it should be allowed
to also use arguments of type IMAGE_TEAM instead of the image set in 
NOTIFY and QUERY statements.


5. Some final remarks
~~~~~~~~~~~~~~~~~~~~~

The NOTIFY and QUERY statements provide a more general load balancing
synchronization facility than the corresponding UPC construct. 
In UPC, the upc_notify and upc_query are always collective; to 
avoid deadlocks it is not allowed to start a new notification while
a previous one is still open. In Fortran, apart from the possibility 
to perform NOTIFY and QUERY for arbitrary subsets of images, it is 
also possible to construct a split-phase barrier by executing NOTIFY
and QUERY with the same subset of images and that subset as 
image-set argument. By using different event variables, new 
notifications may be started before previous ones have completed
without incurring deadlocks. In particular, using a split-phase barrier
together with collective functions may provide improved performance if
the collectives do not enforce synchronization at entry.