ISO/IEC JTC1/SC22/WG5 N1856 

A Critique of ISO/IEC JTC1/SC22/WG5 N1835 (Addition/Modification of CAF Features)
---------------------------------------------------------------------------------

   Laksono Adhianto, John Mellor-Crummey, Guohua Jin, Karthik Murthy, Dung Nguyen, 
           William N. Scherer III, Scott Warren, and Chaoran Yang
{laksono, johnmc, jin, Karthik.S.Murthy, dxnguyen, scherer, scott, chaoran}@rice.edu


In this article, we provide commentary on the feature additions/modifications
to J3/08-131r1 based on the discussion held in Sept 2010. This document is
available online as ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/N1835.txt.  Our
commentary is based on our experiences with developing and using the Rice
Coarray Fortran 2.0 (Rice CAF 2.0) programming language, runtime, and
translator.


Proposal 1.
-----------

We generally support this proposal; however, we believe that a larger set of
intrinsics would be useful. In particular, the full set of collectives
supported by MPI seems worth considering.

Although we did not implement Rice CAF 2.0 collectives in this manner, having
an optional result parameter seems reasonable to us.


Proposal 2.
----------- 

We agree that "raw" atomic operations are useful for development of
high-performance synchronization and concurrency routines. We suggest that
the committee consider the equivalent feature set from the Java programming
language, which appears in the java.util.concurrent library, as it has been
very successful in that community. Specifically, it supports two key features
that are missing from this proposal:

(1) atomic swap, also known as fetch-and-store, is necessary for the
implementation of commercially important algorithms including the acquire()
routine for the widely used MCS queue-based lock.  Although atomic swap can
be simulated via a looped CAS construct, this is an imperfect approximation
because the CAS loop can fail arbitrarily many times (starvation) before
success; atomic swap is guaranteed to complete within a bounded length of
time.

(2) CAS on pointer values -- equivalent to the
java.util.concurrent.AtomicReference class -- is necessary for the
implementation of virtually all concurrent algorithms that are currently in
use.  In C, support for integers is sufficient because its more powerful cast
operations allow the programmer to cast a pointer to an integer type;
however, the equivalent functionality is not present in Fortran due to its
stronger typing.

We note that the restriction of types to exclude variables of type real seems
arbitrary; however, we have no opinion on whether they should be explicitly
included as possible targets of the atomic instructions.

Finally, we observe that some level of protection against the so-called ABA
problem is desirable. The ABA problem occurs when a CAS is made against a
value that has changed but has then accidentally changed back to its original
value between when it was first read and when the CAS is effected.  In this
case, it is usually wrong (algorithmically) for the CAS to succeed; this
leads to subtle corruption and difficult to track down race conditions in
code.

We additionally refer the committee to the C++ atomics standardization work
by Hans Boehm and Lawrence Crowl [1].


Proposal 3.
----------- 

3b) We generally concur that restrictions should only be present when
absolutely necessary.

3c) In our view, there is already enough confusion in the world about the
difference between global and local synchronization.  They are very different
things; combining them into a single sync statement will only serve to
increase the confusion.

3d) We see no problem with allowing functions to have side effects. Rather
than an IMPURE attribute that proclaims a function free of side effects,
however, we espouse a PURE attribute that is an explicit promise, made by the
programmer, that a function is side effect free.

3e) Fundamentally, we disagree with requiring MPI in addition to Fortran in
order to have a complete programming model: [Coarray] Fortran should stand on
its own.  There is substantial utility to having a rich set of collectives;
and compiler support for them can greatly ease the burden on the programmer
(and reduce opportunities for error) when using them.  For example, in the
Rice CAF 2.0 implementation, we have built support in the compiler to
automatically compute sizes of data and to generate callback functions.

3f) Teams are needed for coupled codes and are very useful for linear algebra
applications.  Again, we disagree strongly with requiring MPI in addition to
Fortran in order to have a complete programming model.  This is particularly
true when an all-coarray Fortran program could be aesthetically pleasing.

3g) We dislike notify and query as we strongly prefer first-class events.
Instead of directly synchronizing with another processor, we find it a far
better programming model to synchronize with an event that is logically
connected to remote data.  Further, events provide a safe synchronization
space: If a library method notifies an event, that notification cannot be
picked up by a waiting operation in user code, but with direct
processor-to-processor synchronization, the same cannot be said.  Debugging
synchronization errors of this form is slow, tedious, and painful.

3h) Rather than have an intrinsic isMyLock that is specific to locks, we
propose extending imageof() from handling just copointers to also handling
locks and events.  However, we note that many implementations will wish to
use a test-and-test-and-set lock, for which lock ownership information is not
normally stored with the lock.

Rather than isLocked(), we would suggest adding a trylock() function that
attempts to acquire a lock if it is unlocked and fails otherwise.
Programmers should not write their own spin loops.  Locks can implement their
own spins, including spin-then-yield code as appropriate.  This gains
efficiency since no traversal of data structures is necessary to find the
memory location to spin on.


On the subject of locks, we note that formal locksets allow multi-lock
locking to occur in a canonical order; this provides a degree of safety
against cyclic deadlock in multi-lock codes.  A very simple canonical order
would be the address of the lock variables.

3i) While compatibility is useful, we reiterate our stance that Fortran 2008
should stand on its own.  For example, the compiler can generate
multithreaded or CUDA code from a do concurrent loop.  Requiring CUDA +
OpenMP + MPI + CAF is far less aesthetically appealing than an all-CAF
solution.


Proposal 4.
-----------

This proposal is subsumed by our approach to copointers, the details of which
appear in Appendix II.  In particular, we observe that adding the cotarget
attribute to a non-coarray variable makes it a coscalar by requiring that it
be allocated in shared memory space.

We see no need for the relocate() statement nor for the image= qualifier to
the allocate statement.  Functionality equivalent to relocate() can be
achieved by just reallocating the scalar and copying date from the old
location to the new.  Functionality equivalent to the image= qualifier can be
achieved by placing a conditional around the allocation statement:

        if (mype .eq. 4) then
           allocate(foo)
        endif

We note that for caching purposes, it suffices to copy a coscalar to a local
variable.

In general, the heap is not symmetric; providing optimizations based on an
assumption otherwise seems ill advised.


Proposal 5.
-----------

We are in full agreement that asynchronous collective operations are useful
and desirable.  In fact, we have used them to good effect in developing Rice
CAF 2.0 implementations for the High Performance Computing Challenge (HPCC)
benchmarks [2].

Rice CAF 2.0 supports two variants of asynchrony for collectives.  In the
explicit model, an event variable is supplied as a parameter to the
collective.  Upon completion of the collective operation, the event is
notified.  This allows the programmer to determine when the collective
operation has completed so that subsequent code, predicated on completion of
the collective, may be executed.

        co_sum_async(some_coarray, some_event)  ! kick off a reduction
        ...                                     ! overlap computation with it
        event_wait(some_event)                  ! ensure it has completed


In contrast, in the implicit model, the programmer omits the event variables
and instead calls an explicit "cofence" to be sure that all pending
operations have completed:

        co_sum_async(some_coarray)  ! kick off an asynchronous reduction
        ...                         ! overlap computation with the reduction
        cofence                     ! ensure it has completed

For more details of the cofence, see Appendix I.


In addition to collectives, we have found substantial benefit in supporting
two other asynchronous functions:

(1) An asynchronous barrier offers the same functionality as does a
split-phased barrier.  Triggering the barrier is equivalent to a notify, and
waiting on the event/blocking with a cofence is equivalent to awaiting
completion of the barrier.

(2) A predicated asynchronous copy allows data to be transferred to/from a
remote image as soon as it is ready, and automatically notify when the copy
has completed.  This is useful, for example, in a scenario where we have
initialization to perform and need data from a partner:

         copy_async(my_buffer, remote_buffer[partner], pred_event, &
                    data_copied_event)
         ...
         ! perform other initialization while waiting for the data
         ...
         event_wait(data_copied_event)      ! make sure we have the data
         ! proceed with computation

Here, we have overlapped the computation of our initialization with the
communication of data into my_buffer from partner's remote_buffer.


Proposal 6.
-----------

Issue A) We believe that this is a non-issue.  The allocation of coarray 'a'
on team c would overwrite the pointer to a on overlapping members; there can
be only one 'a' on any image.  This would of course be a programming error
that could be checked at runtime when trying to allocate an already-allocated
pointer.

In the Rice CAF 2.0 implementation, coarrays are registered after allocation;
the name duplication conflict would manifest in this stage if it had not
previously been detected.

Issue B) As detailed in Tony Skjellum's rationale for MPI libraries [3],
reindexing is crucial if support libraries are to be developed.  We agree
that having ranks > 1 poses several logistical problems from a language
viewpoint.  This is precisely why we oppose having more than one rank for
codimensions.

However, to give the functionality of multiple dimensions, we support
topologies.  In particular, with a cartesian topology, one can write code
that appears to index multiple ranks.  The indexing is reduced by the
topology to a linearized one-dimensional index into the single physical rank
for the coarray.  This resolves the issues described here.

Issue C). We believe that teams are very useful for many applications,
including coupled codes and linear algebra applications to name two.  We urge
the committee not to remove them from the Fortran 2008 specification.


Proposal 7.
-----------

This proposal is subsumed by our approach to copointers, the details of which
appear in Appendix II.


Proposal 8.
-----------

We agree that this proposal has appeal.  In fact, an early version of our CAF
2.0 implementation supported asymmetric coarrays.  But when we tried it, it
caused chaos with reshaping of arrays.

Suppose, for example that we have 2D arrays of different sizes.  Now suppose
we pass column 3 to a local subroutine, which then tries to access that
column on another image.  We see no reasonable way to handle the case where
that column does not exist on the remote image.  Further, even if the remote
image *does* have a third column in the coarray, what if the columns are of
differing lengths? The subroutine has no good way to know the bounds of the
column on the remote image. A semantic problem occurs when we attempt to
access the entire column (via a ':' operator): does the colon refer to the
local or remote bounds?

For all of these reasons, we dropped support for asymmetric coarrays from our
CAF 2.0 compiler.


Proposal 9.
-----------

We note that when reading a standard, it is useful to have names that are
logically associated appear near to each other in the standard, including the
index and a table of intrinsics.  For this reason, we have adopted event_wait
and event_notify in the Rice CAF 2.0 implementation.

9.1) As detailed in our memory model notes (see Appendix I), notify should be
a "release" operation.  Coarray operations that appear after a notify can
execute before the notify, but no coarray operations before a notify should
execute after the notify.  This is needed to make events reasonable: If a
programmer writes to a remote coarray then performs a notify to signal that
the write has completed, the write had better not be delayed until after the
notification!

Similarly, query should be an "acquire" operation (antisymmetric dependences).

In general, the semantics of notify should be non-blocking.  Notification
should occur after the communication completes, but there is no need to block
the caller until that time.  This would just make it harder to overlap
communication latency with computation (which is crucial for extracting
maximum performance in HPC environments).

9.2) It seems strange to separate the image number and event name when they
could be combined into a single parameter.  For example, the second statement
below seems far more intuitive and in keeping with existing coarray syntax:

        notify(3, some_event(i))       ! As proposed
        notify(some_event(i)[3])       ! Implemented in Rice CAF 2.0

9.3) We disagree with restricting the number of outstanding notifies to one.
For example, a bounded buffer implementation could take advantage of -- and
would require -- higher limits.

9.4) We note that image numbers should be relative to a team.  For example,
in the following call, j is relative to the team some_team, not an absolute
image number:

        notify(some_event(i)[j@some_team])

9.5) Please don't conflate notify and query with (asynchronous) barriers.
Point-to-point and collective operations should be kept as separate
operations.


On the subject of events, similar to the locksets we proposed earlier in this
document, we propose eventsets.  As implemented in Rice CAF 2.0, these
collections of events offer programmers the following convenient
functionality:

        notifyall:    perform a notify on each member event
        waitall:      wait for each member event to be notified
        wantany:      wait for one member event to be notified, similar to
                      the socket library select() method
        waitanyfair:  wait for one of the member events that has received
                      the fewest notifications to be notified

Since it may not be obvious, the intent behind waitanyfair is that by calling
it in a loop exactly N times, where N is the cardinality of the eventset, it
is guaranteed that each component event will have been notified exactly once
at the termination of the loop.


References
----------

[1] Hans-J. Boehm, Lawrence Crowl. C++ Atomic Types and Operations. ISO/IEC
JTC1 SC22 WG21 N2427 = 07-0297 - 2007-10-03.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html.

[2] HPC Challenge benchmark. http://icl.cs.utk.edu/hpcc.

[3] A. Skjellum, N. E. Doss, and P. V. Bangalore. Writing libraries in
MPI. In A. Skjellum and D. S.  Reese, editors, Proceedings of the Scalable
Parallel Libraries Conference, pages 166–173. IEEE Computer Society Press,
October 1993.



Appendix I: Commentary on the Fortran 2008 Memory Model
-------------------------------------------------------

In this section, we present our views/comments on the memory model described
in the draft Fortran 2008 standard (while the memory model has not been
formally described in the F2008 standard, our views are based on information
mined from it, especially Section 8.5: Image Execution Control).


Comment #1: The draft standard does not define the consistency requirements
within a segment.

We recommend processor consistency for coarray reads/writes within a segment.

The absence of any form of consistency within a segment allows aggressive
compiler/hardware reorderings; this forces the programmer to introduce
numerous memory fences in the program for correctness, making it harder to 
add optimization.


Comment #2: The current memory model effects a difficult programming model.

We believe that the average programmer should not be exposed to the
intricacies of the memory model (such as needing to use sync_memory) in
order to write correct code.
	
The current memory model supports a "performance-first" approach by allowing
aggressive compiler/hardware optimizations which re-order within a segment or
between segments which are not ordered via image control
constructs. Programmers have to use sync_memory to avoid subtle race
conditions in most places especially when asynchronous operations are
employed. We believe that average programmers should not need to learn the
intricacies of the memory model to obtain correct code.


Comment #3: The current memory model lacks predicated fences.

We believe that predicated fences, such as our cofence, are a necessary
addition.

The current memory fences (sync_memory) are not sufficiently flexible to
provide the required performance tuning that advanced programmers would need.
As currently described, sync_memory acts as a barrier for all memory and
coaarray operations. However, advanced programmers need constructs to
separately capture the local/global completion of coarray operations,
especially asynchronous ones.

The cofence construct allows programmers to control the local completion of
put, get, and implicitly synchronized asynchronous operations. The cofence
API is as follows:

        cofence({DOWNWARD=PUT/GET/PUT_GET}, {UPWARD=PUT/GET/PUT_GET})

Cofence takes two optional arguments.  The first specifies which categories
of implicit asynchronous operations i.e. "put/get" to allow downwards, and
the second argument specifies which category of implicit asynchronous
operations to allow upwards. Depending upon the argument values passed, the
cofence allows puts, gets, or both to pass across the cofence in the
specified direction.

Let us consider a collective asynchronous broadcast operation to understand
the use of cofences in tuning performance. 

        ! process p is performing a broadcast 
        broadcast_async(buffer, p) 
        cofence(DOWNWARD=GET, UPWARD=PUT_GET) 

        ! after the cofence, buffer can be safely overwritten
        buf = ... 

        ! wait for global completion of the broadcast
        confence

In the above code sample, process p is performing an asynchronous
broadcast. Once p sends the broadcast data to its children (i.e. the
broadcast is locally complete in p) p does not need to participate in the
remainder the broadcast. Processor p can thus overlap useful work, such as
preparing the next iteration of buffer, with waiting for the broadcast to
complete.


While capturing this local completion, it is performance efficient to allow
other "get" memory operations to be performed later (allowed to pass
downwards) or "put/get" memory operations to be performed earlier (upwards)
relative to the cofence.  A full memory barrier would not allow these
efficiencies.

The broadcast is globally complete when all participating processes obtain
the broadcast data. Global completion is important for process p if
activities after the broadcast in p are dependent directly/transitively on
the assumption that the other processes have received the broadcast data.


Comment #4: It is not clearly stated (but it is implied) that functions
should not have side effects. This should be clarified in the standard.



Appendix II: Copointers in Rice CAF 2.0
---------------------------------------

CAF 2.0 adds global pointers to the Fortran language in support of irregular
data decompositions, distributed linked data structures, and parallel model
coupling. The definition and use of these new "copointers" is as similar as
possible to ordinary Fortran pointers: they are declared with new attributes
analogous to 'pointer' and 'target', manipulated with the existing '=>'
pointer assignment statement, and inspected with the existing pointer
intrinsics. Accessing data via copointers is as similar as possible to
existing coarray accesses, with implicit access to the local image and
explicit access to remote images using a square-bracket notation. CAF 2.0's
copointers may point to values of any type, including coarrays; we believe
that copointers to coarrays will be especially valuable for parallel model
coupling in systems like the Community Earth System Model. Copointers can be
implemented easily and efficiently in existing CAF compilers; we have already
begun adding them to our prototype CAF 2.0 compiler.

The rest of this note explains the copointer concept in more detail, then
describes how copointers are declared, created, copied, dereferenced, and
inspected. It closes by mentioning a few nonobvious semantic details and
sketching an implementation strategy.

The approach here is tutorial rather than formal and terminology is for the
most part programmer- oriented rather than compatible with the Fortran
standard documents. For instance, we usually say "variable" rather than the
standards' "entity" and "points to" rather than "is associated with". But not
always!

COPOINTERS AND COTARGETS

Copointers are typed "global pointers" which can point to storage on any
processor ("image") in a parallel computer. Each copointer points to a
specific typed block of storage (a Fortran "entity") allocated on a specific
image. Despite the"co" in their name, copointers are not distributed across
images like coarrays; each copointer is a small scalar value residing on a
single image. Apart from their global reach, the semantics of copointers is
nearly identical to the semantics of ordinary Fortran pointers: copointer
variables and copointer components of derived types may be declared, set to
point to other entities, copied, dereferenced, sectioned via subscripting to
yield copointers to subentities, and examined via the existing Fortran
pointer intrinsics. It may be helpful to think of a copointer as a pair <i,p>
where 'i' is an image number and 'p' is an ordinary Fortran pointer valid on
'i', although the implementation may be different.

Cotargets are entities which may become the destination of a copointer. Such
entities must be declared with the 'cotarget' attribute, just as potential
destinations of ordinary pointers must be declared with the 'target'
attribute. If a CAF2 implementation relies on special "shared memory" regions
for efficient communication between images, then it will allocate entities
with the 'cotarget' attribute in such a region. Cotarget entities are in all
other respects ordinary entities and may be used locally without restriction.

Copointer values may be freely copied, even from one image to another, and
each new copy points to the same specific storage block on the same specific
image as does the original copointer. Creating and copying copointers are
cheap, purely local operations. So is dereferencing a copointer that happens
to point to the image doing the dereferencing. Dereferencing a copointer
which points to a different image requires the same sort of communication as
a corresponding off-image coarray reference.

DECLARING COPOINTERS AND COTARGETS

Copointer and cotarget entities are declared with the usual Fortran
declaration syntax augmented with new 'copointer' and 'cotarget'
attributes. For instance, to declare an integer array and a copointer which
can point to it, we write

        integer, dimension(10), cotarget :: a1
        integer, dimension(:), copointer :: p1

This makes 'a1' an array of 10 integers allocated in shared memory and 'p1' a
copointer variable of compatible type. Copointers may point to entities of
any type, subject to the limitations of Fortran's attribute syntax as
explained in the next paragraph. In particular CAF 2.0 allows copointers to
coarrays, providing an expressive and efficient mechanism for model coupling
in large parallel codes. The 'copointer' and 'cotarget' attributes may be
combined with other Fortran attributes just as 'pointer' and 'target' may
be. For instance,

        type(t), dimension(:,:), save, contiguous, copointer :: p2

declares a copointer entity 'p2' which points to two-dimensional arrays of
elements of derived type 't', which retains its association across subprogram
invocations, and which can only be associated with contiguous cotarget
arrays.

Declaring cotargets needs no further explanation. To describe how copointer
types are declared we must first consider a key syntactic feature of
Fortran's existing type declarations: namely, that the textual order in which
an entity's attributes are given is insignificant. This feature both resolves
potential ambiguities and limits the set of data types which can be
expressed. For instance, both of the following declarations specify type
"pointer to array of integer":

        integer, pointer, dimension(:) :: p3
        integer, dimension(:), pointer :: p4

Since the order of appearance of 'pointer' and 'dimension' does not matter,
the ambiguity in interpretation is resolved by a rule we can write as
"pointer < dimension"; that is, 'pointer' has lower syntactic priority than
'dimension' and so is applied later during type formation, giving
"pointer(dimension1(integer))" as the specified type. Because of this rule,
there is no way to express the type "array of pointer to integer" in
Fortran. However, the missing type can be simulated by wrapping a pointer in
a derived type:

        type :: t; integer, pointer :: p; end type
        type(t), dimension(:) :: a2
        ! initialize a2 …
        a2(1)%p = 0

We can now describe the precise syntactic intepretation of 'copointer' in CAF
2.0 by the following rules:

        pointer < copointer < codimension < dimension

These precedence relations are consistent with the existing syntax of Fortran
2008 and give an unambiguous interpretation of every possible combination of
these four attributes in a type declaration. For instance, both of the
following declarations specify a copointer to a coarray of corank 1, rank 2,
and element type integer:

        integer, dimension(:,:), codimension(:), copointer :: p5
        integer, copointer :: p6(:,:)[*]

In each declaration, the three attributes 'copointer', 'codimension', and
'dimension' occur and are interpreted in that order to give the type
"copointer(codimension1(dimension1(integer)))".

Like Fortran 2008's, CAF 2.0's attribute interpretation rules resolve
ambiguity at the cost of limiting the set of types which can be directly
expressed. For instance, "array of copointer" can't be expressed but can be
simulated with derived types just as shown above for "array of pointer".

CREATING AND COPYING COPOINTERS

Copointers are created and copied via Fortran's existing 'allocate' and
pointer assignment statements in the same way as ordinary pointers. There are
four cases to consider.

(1) A copointer is created when an 'allocate' statement is executed with a
copointer variable as its argument. The allocated storage comes from the
current image's shared memory region so that it can be accessed from any
other image. A copointer to that storage is created and stored in the
argument variable.

(2) A copointer is created when a pointer assignment statement's right hand
side (RHS) is a plain data reference; a new copointer to the RHS is assigned
to the variable on the left hand side (LHS). (In Fortran terminology, the LHS
entity "becomes copointer associated with" the RHS data ref.) The RHS must
have the 'cotarget' attribute. The RHS may be either a reference to local
data on this image or a reference to remote data on another image; in either
case, a copointer is created which points to the RHS data. Of course, for an
RHS to reference remote data it must be a coarray reference or a
copointer-dereference expression (next section). For instance, the following
two statements both create copointers, one pointing to a local array and one
pointing to an array on another image:

        integer, dimension(:), copointer :: p7, p8  ! copointer to array of integer
        integer, dimension(10), cotarget :: a3[*]   ! coarray of array of integer
        p7 => a3                                    ! copointer to a3's local array
        p8 => a3[9]                                 ! copointer to a3 on image 9

(3) When a pointer assignment statement's RHS is an ordinary (i.e. local)
pointer, the local pointer cannot be copied as-is into the LHS because its
type is not correct. Instead, the pointer is converted into a copointer and
assigned to the LHS; this is a form of copointer creation. For instance:

        integer, dimension(:), pointer :: r  ! pointer to array of integer
        r => a3                              ! creates pointer to a3's local array
        p7 => r                              ! converts local pointer to copointer

(4) A copointer is copied when a pointer assignment statement's RHS is
already a copointer. Given the previous declarations, the following statement
copies an existing copointer:

        p7 => p8

DEREFERENCING COPOINTERS

Copointers may be "dereferenced" to get a data reference that can be used in
either RHS or LHS contexts. In general the data reference is remote, so
loading from it and storing into it require communication with another
image. For this reason, CAF 2.0 requires copointers to be explicitly
dereferenced via a new "co-dereference operator" ([ ]) to indicate this
communication cost in the source code. This is in contrast to Fortran's
implicit dereferencing of ordinary pointers. For instance, the previously
introduced variable 'p7' is a copointer to array of integer, so 'p7 ' is just
an array of integer, and the following assignments copy integers and integer
arrays between this image and some other image:

        integer :: k
        integer, dimension(10) :: a4
        k = p7[ ](1)
        a4 = p7[ ]
        p7[ ](1) = a4(1)
        p7[ ] = a4

For additional expressiveness, CAF 2.0 allows a copointer to be dereferenced
implicitly when it is known that the copointer points to local data. This
indicates in the source code that the dereference operation requires no
communication. The result of an implicit dereference is undefined if the
copointer points to another image. For instance, if the value of 'p7' is a
copointer to this image we can write:

        k = p7(1)
        a4 = p7
        p7(1) = a4(1)
        p7 = a4

COPOINTER INTRINSIC FUNCTIONS

CAF 2.0 extends the pointer-related intrinsic procedures of Fortran 2008 to
work with copointers as well. For instance, 'associated(p7)' returns a
boolean indicating whether 'p7' is associated with a target, and 'p7 =>
null()' sets 'p7' to disassociated status.

In addition, CAF 2.0 provides a new intrinsic function 'imageof' which
returns the image number to which an associated copointer points. It is
undefined if the copointer is disassociated.

SEMANTIC DETAILS

Here are a few related details of CAF 2.0 semantics.

(1) A copointer value may be implicitly converted into an ordinary pointer
when it is known that the copointer points to local data. The result is
undefined if it points to another image. For instance, if the value of 'p7'
is a copointer to this image we can write:

        r => p7

(2) Fortran 2008 forbids associating an ordinary pointer with a remote data
reference (a coindexed object, i.e.  all or part of a coarray). Similarly,
CAF 2.0 forbids associating an ordinary pointer with the result of
dereferencing a copointer. Thus the following statement is incorrect:

        r => p7[ ]	! not allowed, even though RHS is type-compatible 
                        ! with 'r' ("array of integer")

(3) As mentioned above, CAF 2.0 allows all possible combinations of the four
type-determining attributes. In addition to our new attributes, this extends
Fortran 2008's use of existing attributes by allowing "pointer to
coarray". CAF 2.0 also eliminates Fortran 2008's restrictions on nesting
coarrays and on embedding coarrays within arrays.

IMPLEMENTATION

CAF 2.0's copointers can be easily and efficiently implemented so that
copointer dereferencing is no more expensive than a corresponding coarray
reference, and typically cheaper. To add copointers to a compiler which
already implements coarrays, one has only to factor the code generation for a
coarray reference into two parts: a generalized address calculation to
determine which bytes are needed from which image, followed by a
communication operation to obtain those bytes across the interconnect. Then
the code for a copointer dereference is just the communication code, because
a copointer's representation essentially caches the result of an address
calculation.

Specifically, our prototype CAF 2.0 compiler represents a copointer value as
a pair <i,p> where 'i' is an image number and 'p' is an ordinary Fortran
pointer valid on image 'i'. Our prototype dereferences a copointer to remote
storage by sending its pointer 'p' to the image 'i' that created it,
dereferencing the pointer normally on 'i', and receiving the fetched bytes in
reply. This is about the same communication cost as a corresponding coarray
reference. On a machine whose interconnect hardware supports one-sided
communication, the CAF 2.0 runtime could decode 'p' and use the corresponding
addresses, strides, and lengths to initiate low level hardware communication
directly.

Our prototype's representation does make an assumption about the underlying
Fortran compiler's storage allocator: the allocator must tolerate our copying
and storing of pointers beyond its reach. For instance, the allocator must
not do reference counting or garbage collection, nor storage compaction by
moving blocks and updating pointers, because the allocator cannot see our
copies of pointers on other images. The Fortran language does not necessitate
any of this, and in fact all commonly used Fortran compilers satisfy our
assumption. However, a simple change of representation would permit
implementing CAF 2.0 on an allocator which doesn't satisfy the
assumption. The pointer component 'p' is replaced by an opaque handle 'h'
which can be looked up somehow on image 'i' to yield a corresponding pointer;
instead of sending 'p' to the remote image, one would send 'h' instead at the
same cost, and the rest of the implementation would be unchanged.