ISO/IEC JTC1/SC22/WG5 N1856 A Critique of ISO/IEC JTC1/SC22/WG5 N1835 (Addition/Modification of CAF Features) --------------------------------------------------------------------------------- Laksono Adhianto, John Mellor-Crummey, Guohua Jin, Karthik Murthy, Dung Nguyen, William N. Scherer III, Scott Warren, and Chaoran Yang {laksono, johnmc, jin, Karthik.S.Murthy, dxnguyen, scherer, scott, chaoran}@rice.edu In this article, we provide commentary on the feature additions/modifications to J3/08-131r1 based on the discussion held in Sept 2010. This document is available online as ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/N1835.txt. Our commentary is based on our experiences with developing and using the Rice Coarray Fortran 2.0 (Rice CAF 2.0) programming language, runtime, and translator. Proposal 1. ----------- We generally support this proposal; however, we believe that a larger set of intrinsics would be useful. In particular, the full set of collectives supported by MPI seems worth considering. Although we did not implement Rice CAF 2.0 collectives in this manner, having an optional result parameter seems reasonable to us. Proposal 2. ----------- We agree that "raw" atomic operations are useful for development of high-performance synchronization and concurrency routines. We suggest that the committee consider the equivalent feature set from the Java programming language, which appears in the java.util.concurrent library, as it has been very successful in that community. Specifically, it supports two key features that are missing from this proposal: (1) atomic swap, also known as fetch-and-store, is necessary for the implementation of commercially important algorithms including the acquire() routine for the widely used MCS queue-based lock. Although atomic swap can be simulated via a looped CAS construct, this is an imperfect approximation because the CAS loop can fail arbitrarily many times (starvation) before success; atomic swap is guaranteed to complete within a bounded length of time. (2) CAS on pointer values -- equivalent to the java.util.concurrent.AtomicReference class -- is necessary for the implementation of virtually all concurrent algorithms that are currently in use. In C, support for integers is sufficient because its more powerful cast operations allow the programmer to cast a pointer to an integer type; however, the equivalent functionality is not present in Fortran due to its stronger typing. We note that the restriction of types to exclude variables of type real seems arbitrary; however, we have no opinion on whether they should be explicitly included as possible targets of the atomic instructions. Finally, we observe that some level of protection against the so-called ABA problem is desirable. The ABA problem occurs when a CAS is made against a value that has changed but has then accidentally changed back to its original value between when it was first read and when the CAS is effected. In this case, it is usually wrong (algorithmically) for the CAS to succeed; this leads to subtle corruption and difficult to track down race conditions in code. We additionally refer the committee to the C++ atomics standardization work by Hans Boehm and Lawrence Crowl [1]. Proposal 3. ----------- 3b) We generally concur that restrictions should only be present when absolutely necessary. 3c) In our view, there is already enough confusion in the world about the difference between global and local synchronization. They are very different things; combining them into a single sync statement will only serve to increase the confusion. 3d) We see no problem with allowing functions to have side effects. Rather than an IMPURE attribute that proclaims a function free of side effects, however, we espouse a PURE attribute that is an explicit promise, made by the programmer, that a function is side effect free. 3e) Fundamentally, we disagree with requiring MPI in addition to Fortran in order to have a complete programming model: [Coarray] Fortran should stand on its own. There is substantial utility to having a rich set of collectives; and compiler support for them can greatly ease the burden on the programmer (and reduce opportunities for error) when using them. For example, in the Rice CAF 2.0 implementation, we have built support in the compiler to automatically compute sizes of data and to generate callback functions. 3f) Teams are needed for coupled codes and are very useful for linear algebra applications. Again, we disagree strongly with requiring MPI in addition to Fortran in order to have a complete programming model. This is particularly true when an all-coarray Fortran program could be aesthetically pleasing. 3g) We dislike notify and query as we strongly prefer first-class events. Instead of directly synchronizing with another processor, we find it a far better programming model to synchronize with an event that is logically connected to remote data. Further, events provide a safe synchronization space: If a library method notifies an event, that notification cannot be picked up by a waiting operation in user code, but with direct processor-to-processor synchronization, the same cannot be said. Debugging synchronization errors of this form is slow, tedious, and painful. 3h) Rather than have an intrinsic isMyLock that is specific to locks, we propose extending imageof() from handling just copointers to also handling locks and events. However, we note that many implementations will wish to use a test-and-test-and-set lock, for which lock ownership information is not normally stored with the lock. Rather than isLocked(), we would suggest adding a trylock() function that attempts to acquire a lock if it is unlocked and fails otherwise. Programmers should not write their own spin loops. Locks can implement their own spins, including spin-then-yield code as appropriate. This gains efficiency since no traversal of data structures is necessary to find the memory location to spin on. On the subject of locks, we note that formal locksets allow multi-lock locking to occur in a canonical order; this provides a degree of safety against cyclic deadlock in multi-lock codes. A very simple canonical order would be the address of the lock variables. 3i) While compatibility is useful, we reiterate our stance that Fortran 2008 should stand on its own. For example, the compiler can generate multithreaded or CUDA code from a do concurrent loop. Requiring CUDA + OpenMP + MPI + CAF is far less aesthetically appealing than an all-CAF solution. Proposal 4. ----------- This proposal is subsumed by our approach to copointers, the details of which appear in Appendix II. In particular, we observe that adding the cotarget attribute to a non-coarray variable makes it a coscalar by requiring that it be allocated in shared memory space. We see no need for the relocate() statement nor for the image= qualifier to the allocate statement. Functionality equivalent to relocate() can be achieved by just reallocating the scalar and copying date from the old location to the new. Functionality equivalent to the image= qualifier can be achieved by placing a conditional around the allocation statement: if (mype .eq. 4) then allocate(foo) endif We note that for caching purposes, it suffices to copy a coscalar to a local variable. In general, the heap is not symmetric; providing optimizations based on an assumption otherwise seems ill advised. Proposal 5. ----------- We are in full agreement that asynchronous collective operations are useful and desirable. In fact, we have used them to good effect in developing Rice CAF 2.0 implementations for the High Performance Computing Challenge (HPCC) benchmarks [2]. Rice CAF 2.0 supports two variants of asynchrony for collectives. In the explicit model, an event variable is supplied as a parameter to the collective. Upon completion of the collective operation, the event is notified. This allows the programmer to determine when the collective operation has completed so that subsequent code, predicated on completion of the collective, may be executed. co_sum_async(some_coarray, some_event) ! kick off a reduction ... ! overlap computation with it event_wait(some_event) ! ensure it has completed In contrast, in the implicit model, the programmer omits the event variables and instead calls an explicit "cofence" to be sure that all pending operations have completed: co_sum_async(some_coarray) ! kick off an asynchronous reduction ... ! overlap computation with the reduction cofence ! ensure it has completed For more details of the cofence, see Appendix I. In addition to collectives, we have found substantial benefit in supporting two other asynchronous functions: (1) An asynchronous barrier offers the same functionality as does a split-phased barrier. Triggering the barrier is equivalent to a notify, and waiting on the event/blocking with a cofence is equivalent to awaiting completion of the barrier. (2) A predicated asynchronous copy allows data to be transferred to/from a remote image as soon as it is ready, and automatically notify when the copy has completed. This is useful, for example, in a scenario where we have initialization to perform and need data from a partner: copy_async(my_buffer, remote_buffer[partner], pred_event, & data_copied_event) ... ! perform other initialization while waiting for the data ... event_wait(data_copied_event) ! make sure we have the data ! proceed with computation Here, we have overlapped the computation of our initialization with the communication of data into my_buffer from partner's remote_buffer. Proposal 6. ----------- Issue A) We believe that this is a non-issue. The allocation of coarray 'a' on team c would overwrite the pointer to a on overlapping members; there can be only one 'a' on any image. This would of course be a programming error that could be checked at runtime when trying to allocate an already-allocated pointer. In the Rice CAF 2.0 implementation, coarrays are registered after allocation; the name duplication conflict would manifest in this stage if it had not previously been detected. Issue B) As detailed in Tony Skjellum's rationale for MPI libraries [3], reindexing is crucial if support libraries are to be developed. We agree that having ranks > 1 poses several logistical problems from a language viewpoint. This is precisely why we oppose having more than one rank for codimensions. However, to give the functionality of multiple dimensions, we support topologies. In particular, with a cartesian topology, one can write code that appears to index multiple ranks. The indexing is reduced by the topology to a linearized one-dimensional index into the single physical rank for the coarray. This resolves the issues described here. Issue C). We believe that teams are very useful for many applications, including coupled codes and linear algebra applications to name two. We urge the committee not to remove them from the Fortran 2008 specification. Proposal 7. ----------- This proposal is subsumed by our approach to copointers, the details of which appear in Appendix II. Proposal 8. ----------- We agree that this proposal has appeal. In fact, an early version of our CAF 2.0 implementation supported asymmetric coarrays. But when we tried it, it caused chaos with reshaping of arrays. Suppose, for example that we have 2D arrays of different sizes. Now suppose we pass column 3 to a local subroutine, which then tries to access that column on another image. We see no reasonable way to handle the case where that column does not exist on the remote image. Further, even if the remote image *does* have a third column in the coarray, what if the columns are of differing lengths? The subroutine has no good way to know the bounds of the column on the remote image. A semantic problem occurs when we attempt to access the entire column (via a ':' operator): does the colon refer to the local or remote bounds? For all of these reasons, we dropped support for asymmetric coarrays from our CAF 2.0 compiler. Proposal 9. ----------- We note that when reading a standard, it is useful to have names that are logically associated appear near to each other in the standard, including the index and a table of intrinsics. For this reason, we have adopted event_wait and event_notify in the Rice CAF 2.0 implementation. 9.1) As detailed in our memory model notes (see Appendix I), notify should be a "release" operation. Coarray operations that appear after a notify can execute before the notify, but no coarray operations before a notify should execute after the notify. This is needed to make events reasonable: If a programmer writes to a remote coarray then performs a notify to signal that the write has completed, the write had better not be delayed until after the notification! Similarly, query should be an "acquire" operation (antisymmetric dependences). In general, the semantics of notify should be non-blocking. Notification should occur after the communication completes, but there is no need to block the caller until that time. This would just make it harder to overlap communication latency with computation (which is crucial for extracting maximum performance in HPC environments). 9.2) It seems strange to separate the image number and event name when they could be combined into a single parameter. For example, the second statement below seems far more intuitive and in keeping with existing coarray syntax: notify(3, some_event(i)) ! As proposed notify(some_event(i)[3]) ! Implemented in Rice CAF 2.0 9.3) We disagree with restricting the number of outstanding notifies to one. For example, a bounded buffer implementation could take advantage of -- and would require -- higher limits. 9.4) We note that image numbers should be relative to a team. For example, in the following call, j is relative to the team some_team, not an absolute image number: notify(some_event(i)[j@some_team]) 9.5) Please don't conflate notify and query with (asynchronous) barriers. Point-to-point and collective operations should be kept as separate operations. On the subject of events, similar to the locksets we proposed earlier in this document, we propose eventsets. As implemented in Rice CAF 2.0, these collections of events offer programmers the following convenient functionality: notifyall: perform a notify on each member event waitall: wait for each member event to be notified wantany: wait for one member event to be notified, similar to the socket library select() method waitanyfair: wait for one of the member events that has received the fewest notifications to be notified Since it may not be obvious, the intent behind waitanyfair is that by calling it in a loop exactly N times, where N is the cardinality of the eventset, it is guaranteed that each component event will have been notified exactly once at the termination of the loop. References ---------- [1] Hans-J. Boehm, Lawrence Crowl. C++ Atomic Types and Operations. ISO/IEC JTC1 SC22 WG21 N2427 = 07-0297 - 2007-10-03. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html. [2] HPC Challenge benchmark. http://icl.cs.utk.edu/hpcc. [3] A. Skjellum, N. E. Doss, and P. V. Bangalore. Writing libraries in MPI. In A. Skjellum and D. S. Reese, editors, Proceedings of the Scalable Parallel Libraries Conference, pages 166–173. IEEE Computer Society Press, October 1993. Appendix I: Commentary on the Fortran 2008 Memory Model ------------------------------------------------------- In this section, we present our views/comments on the memory model described in the draft Fortran 2008 standard (while the memory model has not been formally described in the F2008 standard, our views are based on information mined from it, especially Section 8.5: Image Execution Control). Comment #1: The draft standard does not define the consistency requirements within a segment. We recommend processor consistency for coarray reads/writes within a segment. The absence of any form of consistency within a segment allows aggressive compiler/hardware reorderings; this forces the programmer to introduce numerous memory fences in the program for correctness, making it harder to add optimization. Comment #2: The current memory model effects a difficult programming model. We believe that the average programmer should not be exposed to the intricacies of the memory model (such as needing to use sync_memory) in order to write correct code. The current memory model supports a "performance-first" approach by allowing aggressive compiler/hardware optimizations which re-order within a segment or between segments which are not ordered via image control constructs. Programmers have to use sync_memory to avoid subtle race conditions in most places especially when asynchronous operations are employed. We believe that average programmers should not need to learn the intricacies of the memory model to obtain correct code. Comment #3: The current memory model lacks predicated fences. We believe that predicated fences, such as our cofence, are a necessary addition. The current memory fences (sync_memory) are not sufficiently flexible to provide the required performance tuning that advanced programmers would need. As currently described, sync_memory acts as a barrier for all memory and coaarray operations. However, advanced programmers need constructs to separately capture the local/global completion of coarray operations, especially asynchronous ones. The cofence construct allows programmers to control the local completion of put, get, and implicitly synchronized asynchronous operations. The cofence API is as follows: cofence({DOWNWARD=PUT/GET/PUT_GET}, {UPWARD=PUT/GET/PUT_GET}) Cofence takes two optional arguments. The first specifies which categories of implicit asynchronous operations i.e. "put/get" to allow downwards, and the second argument specifies which category of implicit asynchronous operations to allow upwards. Depending upon the argument values passed, the cofence allows puts, gets, or both to pass across the cofence in the specified direction. Let us consider a collective asynchronous broadcast operation to understand the use of cofences in tuning performance. ! process p is performing a broadcast broadcast_async(buffer, p) cofence(DOWNWARD=GET, UPWARD=PUT_GET) ! after the cofence, buffer can be safely overwritten buf = ... ! wait for global completion of the broadcast confence In the above code sample, process p is performing an asynchronous broadcast. Once p sends the broadcast data to its children (i.e. the broadcast is locally complete in p) p does not need to participate in the remainder the broadcast. Processor p can thus overlap useful work, such as preparing the next iteration of buffer, with waiting for the broadcast to complete. While capturing this local completion, it is performance efficient to allow other "get" memory operations to be performed later (allowed to pass downwards) or "put/get" memory operations to be performed earlier (upwards) relative to the cofence. A full memory barrier would not allow these efficiencies. The broadcast is globally complete when all participating processes obtain the broadcast data. Global completion is important for process p if activities after the broadcast in p are dependent directly/transitively on the assumption that the other processes have received the broadcast data. Comment #4: It is not clearly stated (but it is implied) that functions should not have side effects. This should be clarified in the standard. Appendix II: Copointers in Rice CAF 2.0 --------------------------------------- CAF 2.0 adds global pointers to the Fortran language in support of irregular data decompositions, distributed linked data structures, and parallel model coupling. The definition and use of these new "copointers" is as similar as possible to ordinary Fortran pointers: they are declared with new attributes analogous to 'pointer' and 'target', manipulated with the existing '=>' pointer assignment statement, and inspected with the existing pointer intrinsics. Accessing data via copointers is as similar as possible to existing coarray accesses, with implicit access to the local image and explicit access to remote images using a square-bracket notation. CAF 2.0's copointers may point to values of any type, including coarrays; we believe that copointers to coarrays will be especially valuable for parallel model coupling in systems like the Community Earth System Model. Copointers can be implemented easily and efficiently in existing CAF compilers; we have already begun adding them to our prototype CAF 2.0 compiler. The rest of this note explains the copointer concept in more detail, then describes how copointers are declared, created, copied, dereferenced, and inspected. It closes by mentioning a few nonobvious semantic details and sketching an implementation strategy. The approach here is tutorial rather than formal and terminology is for the most part programmer- oriented rather than compatible with the Fortran standard documents. For instance, we usually say "variable" rather than the standards' "entity" and "points to" rather than "is associated with". But not always! COPOINTERS AND COTARGETS Copointers are typed "global pointers" which can point to storage on any processor ("image") in a parallel computer. Each copointer points to a specific typed block of storage (a Fortran "entity") allocated on a specific image. Despite the"co" in their name, copointers are not distributed across images like coarrays; each copointer is a small scalar value residing on a single image. Apart from their global reach, the semantics of copointers is nearly identical to the semantics of ordinary Fortran pointers: copointer variables and copointer components of derived types may be declared, set to point to other entities, copied, dereferenced, sectioned via subscripting to yield copointers to subentities, and examined via the existing Fortran pointer intrinsics. It may be helpful to think of a copointer as a pair where 'i' is an image number and 'p' is an ordinary Fortran pointer valid on 'i', although the implementation may be different. Cotargets are entities which may become the destination of a copointer. Such entities must be declared with the 'cotarget' attribute, just as potential destinations of ordinary pointers must be declared with the 'target' attribute. If a CAF2 implementation relies on special "shared memory" regions for efficient communication between images, then it will allocate entities with the 'cotarget' attribute in such a region. Cotarget entities are in all other respects ordinary entities and may be used locally without restriction. Copointer values may be freely copied, even from one image to another, and each new copy points to the same specific storage block on the same specific image as does the original copointer. Creating and copying copointers are cheap, purely local operations. So is dereferencing a copointer that happens to point to the image doing the dereferencing. Dereferencing a copointer which points to a different image requires the same sort of communication as a corresponding off-image coarray reference. DECLARING COPOINTERS AND COTARGETS Copointer and cotarget entities are declared with the usual Fortran declaration syntax augmented with new 'copointer' and 'cotarget' attributes. For instance, to declare an integer array and a copointer which can point to it, we write integer, dimension(10), cotarget :: a1 integer, dimension(:), copointer :: p1 This makes 'a1' an array of 10 integers allocated in shared memory and 'p1' a copointer variable of compatible type. Copointers may point to entities of any type, subject to the limitations of Fortran's attribute syntax as explained in the next paragraph. In particular CAF 2.0 allows copointers to coarrays, providing an expressive and efficient mechanism for model coupling in large parallel codes. The 'copointer' and 'cotarget' attributes may be combined with other Fortran attributes just as 'pointer' and 'target' may be. For instance, type(t), dimension(:,:), save, contiguous, copointer :: p2 declares a copointer entity 'p2' which points to two-dimensional arrays of elements of derived type 't', which retains its association across subprogram invocations, and which can only be associated with contiguous cotarget arrays. Declaring cotargets needs no further explanation. To describe how copointer types are declared we must first consider a key syntactic feature of Fortran's existing type declarations: namely, that the textual order in which an entity's attributes are given is insignificant. This feature both resolves potential ambiguities and limits the set of data types which can be expressed. For instance, both of the following declarations specify type "pointer to array of integer": integer, pointer, dimension(:) :: p3 integer, dimension(:), pointer :: p4 Since the order of appearance of 'pointer' and 'dimension' does not matter, the ambiguity in interpretation is resolved by a rule we can write as "pointer < dimension"; that is, 'pointer' has lower syntactic priority than 'dimension' and so is applied later during type formation, giving "pointer(dimension1(integer))" as the specified type. Because of this rule, there is no way to express the type "array of pointer to integer" in Fortran. However, the missing type can be simulated by wrapping a pointer in a derived type: type :: t; integer, pointer :: p; end type type(t), dimension(:) :: a2 ! initialize a2 … a2(1)%p = 0 We can now describe the precise syntactic intepretation of 'copointer' in CAF 2.0 by the following rules: pointer < copointer < codimension < dimension These precedence relations are consistent with the existing syntax of Fortran 2008 and give an unambiguous interpretation of every possible combination of these four attributes in a type declaration. For instance, both of the following declarations specify a copointer to a coarray of corank 1, rank 2, and element type integer: integer, dimension(:,:), codimension(:), copointer :: p5 integer, copointer :: p6(:,:)[*] In each declaration, the three attributes 'copointer', 'codimension', and 'dimension' occur and are interpreted in that order to give the type "copointer(codimension1(dimension1(integer)))". Like Fortran 2008's, CAF 2.0's attribute interpretation rules resolve ambiguity at the cost of limiting the set of types which can be directly expressed. For instance, "array of copointer" can't be expressed but can be simulated with derived types just as shown above for "array of pointer". CREATING AND COPYING COPOINTERS Copointers are created and copied via Fortran's existing 'allocate' and pointer assignment statements in the same way as ordinary pointers. There are four cases to consider. (1) A copointer is created when an 'allocate' statement is executed with a copointer variable as its argument. The allocated storage comes from the current image's shared memory region so that it can be accessed from any other image. A copointer to that storage is created and stored in the argument variable. (2) A copointer is created when a pointer assignment statement's right hand side (RHS) is a plain data reference; a new copointer to the RHS is assigned to the variable on the left hand side (LHS). (In Fortran terminology, the LHS entity "becomes copointer associated with" the RHS data ref.) The RHS must have the 'cotarget' attribute. The RHS may be either a reference to local data on this image or a reference to remote data on another image; in either case, a copointer is created which points to the RHS data. Of course, for an RHS to reference remote data it must be a coarray reference or a copointer-dereference expression (next section). For instance, the following two statements both create copointers, one pointing to a local array and one pointing to an array on another image: integer, dimension(:), copointer :: p7, p8 ! copointer to array of integer integer, dimension(10), cotarget :: a3[*] ! coarray of array of integer p7 => a3 ! copointer to a3's local array p8 => a3[9] ! copointer to a3 on image 9 (3) When a pointer assignment statement's RHS is an ordinary (i.e. local) pointer, the local pointer cannot be copied as-is into the LHS because its type is not correct. Instead, the pointer is converted into a copointer and assigned to the LHS; this is a form of copointer creation. For instance: integer, dimension(:), pointer :: r ! pointer to array of integer r => a3 ! creates pointer to a3's local array p7 => r ! converts local pointer to copointer (4) A copointer is copied when a pointer assignment statement's RHS is already a copointer. Given the previous declarations, the following statement copies an existing copointer: p7 => p8 DEREFERENCING COPOINTERS Copointers may be "dereferenced" to get a data reference that can be used in either RHS or LHS contexts. In general the data reference is remote, so loading from it and storing into it require communication with another image. For this reason, CAF 2.0 requires copointers to be explicitly dereferenced via a new "co-dereference operator" ([ ]) to indicate this communication cost in the source code. This is in contrast to Fortran's implicit dereferencing of ordinary pointers. For instance, the previously introduced variable 'p7' is a copointer to array of integer, so 'p7 ' is just an array of integer, and the following assignments copy integers and integer arrays between this image and some other image: integer :: k integer, dimension(10) :: a4 k = p7[ ](1) a4 = p7[ ] p7[ ](1) = a4(1) p7[ ] = a4 For additional expressiveness, CAF 2.0 allows a copointer to be dereferenced implicitly when it is known that the copointer points to local data. This indicates in the source code that the dereference operation requires no communication. The result of an implicit dereference is undefined if the copointer points to another image. For instance, if the value of 'p7' is a copointer to this image we can write: k = p7(1) a4 = p7 p7(1) = a4(1) p7 = a4 COPOINTER INTRINSIC FUNCTIONS CAF 2.0 extends the pointer-related intrinsic procedures of Fortran 2008 to work with copointers as well. For instance, 'associated(p7)' returns a boolean indicating whether 'p7' is associated with a target, and 'p7 => null()' sets 'p7' to disassociated status. In addition, CAF 2.0 provides a new intrinsic function 'imageof' which returns the image number to which an associated copointer points. It is undefined if the copointer is disassociated. SEMANTIC DETAILS Here are a few related details of CAF 2.0 semantics. (1) A copointer value may be implicitly converted into an ordinary pointer when it is known that the copointer points to local data. The result is undefined if it points to another image. For instance, if the value of 'p7' is a copointer to this image we can write: r => p7 (2) Fortran 2008 forbids associating an ordinary pointer with a remote data reference (a coindexed object, i.e. all or part of a coarray). Similarly, CAF 2.0 forbids associating an ordinary pointer with the result of dereferencing a copointer. Thus the following statement is incorrect: r => p7[ ] ! not allowed, even though RHS is type-compatible ! with 'r' ("array of integer") (3) As mentioned above, CAF 2.0 allows all possible combinations of the four type-determining attributes. In addition to our new attributes, this extends Fortran 2008's use of existing attributes by allowing "pointer to coarray". CAF 2.0 also eliminates Fortran 2008's restrictions on nesting coarrays and on embedding coarrays within arrays. IMPLEMENTATION CAF 2.0's copointers can be easily and efficiently implemented so that copointer dereferencing is no more expensive than a corresponding coarray reference, and typically cheaper. To add copointers to a compiler which already implements coarrays, one has only to factor the code generation for a coarray reference into two parts: a generalized address calculation to determine which bytes are needed from which image, followed by a communication operation to obtain those bytes across the interconnect. Then the code for a copointer dereference is just the communication code, because a copointer's representation essentially caches the result of an address calculation. Specifically, our prototype CAF 2.0 compiler represents a copointer value as a pair where 'i' is an image number and 'p' is an ordinary Fortran pointer valid on image 'i'. Our prototype dereferences a copointer to remote storage by sending its pointer 'p' to the image 'i' that created it, dereferencing the pointer normally on 'i', and receiving the fetched bytes in reply. This is about the same communication cost as a corresponding coarray reference. On a machine whose interconnect hardware supports one-sided communication, the CAF 2.0 runtime could decode 'p' and use the corresponding addresses, strides, and lengths to initiate low level hardware communication directly. Our prototype's representation does make an assumption about the underlying Fortran compiler's storage allocator: the allocator must tolerate our copying and storing of pointers beyond its reach. For instance, the allocator must not do reference counting or garbage collection, nor storage compaction by moving blocks and updating pointers, because the allocator cannot see our copies of pointers on other images. The Fortran language does not necessitate any of this, and in fact all commonly used Fortran compilers satisfy our assumption. However, a simple change of representation would permit implementing CAF 2.0 on an allocator which doesn't satisfy the assumption. The pointer component 'p' is replaced by an opaque handle 'h' which can be looked up somehow on image 'i' to yield a corresponding pointer; instead of sending 'p' to the remote image, one would send 'h' instead at the same cost, and the rest of the implementation would be unchanged.