ISO/IEC JTC1/SC22/WG5-N1745 VOLATILE Coarrays ----------------- Nick Maclaren on behalf of the UK panel, 14th October 2008. 0. Introduction --------------- The concept of 'volatile' objects, as used in C, Fortran etc., has always been a problem for language semantics, because it introduces the concept of objects changing value without explicit action by the program. Fortran VOLATILE coarrays overload it with another concept, that of parallel atomicity - i.e. that a value can be changed from one value to another by one image, and either the new or the old value is seen by all other images (without an intervening period of being undefined). These are several major problems with them. There are major specification problems with VOLATILE coarrays, which are discussed below; some are sufficiently serious that resolving them would mean pervasive changes, which would probably be regarded as too restrictive to be acceptable. Even ignoring those, the specification is too imprecise to know what effects Fortran requires a processor to deliver to the programmer, and it is therefore almost impossible for a programmer to write code that is reliably portable. Experience with similar parallel interfaces shows that few (if any) ordinary programmers can use volatile objects correctly, and even experts have difficulty, especially when the specifications are imprecise. The introduction of VOLATILE coarrays also means that many existing, widespread, important serial optimisations cannot be performed without changing the results, even on code that makes no use of either coarrays or VOLATILE. Downgrading the optimisation for serial code from that which is possible in Fortran 2003 will be unacceptable to many people. Lastly, it is unclear whether VOLATILE coarrays can be implemented with an acceptable degree of efficiency on the now ubiquitous 'commodity clusters'[*]. For these reasons, we feel that VOLATILE coarrays should be removed from the Fortran standard, and more appropriate (higher-level) mechanisms included (possibly after more implementation experience). [*] The term 'commodity cluster' refers to a collection of off-the-shelf workstations or small servers, connected by TCP/IP and Ethernet (or possibly InfiniBand), and running some widely-available operating system (such as a Linux or Unix variant or Microsoft system). On such systems, the compilers, language run-time systems and applications libraries are often written by separate organisations, and always run without 'system privileges' or operating system extensions. 1. A Better Approach -------------------- The major use of volatile data objects in parallelism, in the languages that have them, is by experts for writing signal handling and synchronisation primitives. A second use is for essentially trivial tasks, such as setting and testing a single global flag variable or writing a simple parallel reduction. Fortran is a high-level language, and the cleanest solution would be to remove VOLATILE coarrays, thus eliminating all the problems they cause, and to specify the high-level parallelism primitives directly. These need not be standardised immediately, which would give time to design them properly, and to obtain experience with implementation and use. This paper does not make any proposal for such primitives, but the following is a description of the sort of ones that are envisaged: 1) Locks, mutexes, semaphores etc. Exactly which of these should be specified is a matter of taste, but most experience is that simple uses can be implemented with any of them. Paper J3/08-256 makes a proposal for locks. 2) Explicitly atomic datatypes and operations, including global flag setting, compare-and-swap etc. Separating these from 'normal' Fortran datatypes and operations means that the semantic problems described below can be bypassed, and makes their implementation a lot easier. 3) Global reductions (e.g. summation over images). These have the property that the final value does not become visible until some appropriate synchronisation is performed, and have similar semantic and implementation advantages to explicitly atomic actions. These would provide the facilities that real users need, at a level that they might manage to use correctly. 2. Specification Issues ----------------------- In parallel languages that have similar volatile object semantics, even experts have great difficulty using volatile objects to implement synchronisation primitives unless they keep their code very simple. Experience is that it is too hard for most ordinary programmers, and they usually make serious mistakes by assuming more synchronisation than is actually specified. A great many of these problems are caused by imprecise specifications; these lead to each vendor providing subtly different semantics for volatile data objects, which causes even well-tested programs written by experienced users to fail unpredictably, especially when ported to new systems or when there is a new version of the compiler. There are several major specification problems with VOLATILE coarrays, which fall into two classes: 1) Exactly what is allowed. Some of the examples given here are simple oversights and could be resolved by wording alone, but others are not so easy. Fortran, like most other languages, specifies the language largely by imposing constraints on what a conforming program may do. This issue is less about what may be done, than exactly what effects conforming actions have; that is often not specified. In some cases of VOLATILE coarrays, the effects are almost unspecifiable. Note that parallel memory models are much more complicated than serial ones, because parallelism exposes issues that are hidden in serial languages (except in asynchronous signal handling, which Fortran does not have). 2) The exact effect of actions on VOLATILE coarrays as seen by other images, and the interactions of VOLATILE coarray accesses with segments. This is essentially unspecified, and there are some serious ambiguities. Allowing VOLATILE coarrays requires at least specifications of the granularity of accesses and what the parallel memory model is for them (if not sequential consistency), and some examples of the issues are given here. The problem about providing examples is that simple ones are always unrealistic, and every simple problem can be resolved by an extra constraint. Actual experience of shared memory programs is that the problems arise in code that looks simple but is very hard to analyse. As Lamport observed, there is no way to solve the problem properly except by defining a proper memory model. 2.1 Lack of Safety ------------------ We revisit the example in N1744, "Coarrays and Memory Models", to illustrate the unpredictable behaviour that is possible with VOLATILE coarrays. PROGRAM Memory_Model_1 INTEGER, VOLATILE :: one[*] = 0, two[*] = 0 INTEGER :: p, q SELECT CASE(THIS_IMAGE()) CASE(1) one[8] = 123 CASE(2) two[9] = 456 CASE(3) p = one[8] q = two[9] WRITE (3,*) p, q CASE(4) q = two[9] p = one[8] WRITE (4,*) p, q END SELECT END PROGRAM Memory_Model_1 There is no requirement for images 1 and 2 to check that the new values have reached images 8 and 9 until after executing SYNC ALL. Hence the value of one[8] accessed by images 3 and 4 may be either 0 or 123. Similarly, the value of two[9] may be either 0 or 456. Furthermore, the combination '123 0' on unit 3 and '0 456' on unit 4 can occur if image 3 has better communication with image 9 than 8 but image 4 has better communication with image 8 than 9. In fact, all combinations of '0 0', '123 0', '0 456' and '123 456' are possible and the result can vary from run to run. Some combinations may occur quite rarely, making unexpected results occur in code that was thought to be tested. Note that this example is the simplest that shows the issue; more complex, but still realistic, examples are available from the author. 2.2 Varying the VOLATILE Attribute of a Coarray Between Scopes -------------------------------------------------------------- A very simple example of this is: PROGRAM Memory_Model_3 INTEGER :: one[*] = 0, two[*] = 0 INTEGER :: p, q SELECT CASE(THIS_IMAGE()) CASE(1) one[8] = 123 CASE(2) two[9] = 456 CASE(3) p = Get(one,8) q = Get(two,9) WRITE (3,*) p, q CASE(4) q = Get(two,9) p = Get(one,8) WRITE (4,*) p, q END SELECT CONTAINS FUNCTION Get (z, n) INTEGER, VOLATILE :: z[*] INTEGER :: Get, n Get = z[n] END FUNCTION Get END PROGRAM Memory_Model_3 The question here is whether this changes anything from the previous example. The above code seems to meet the liberty allowed in 8.5.1 Image control statements, paragraph 6: A coarray that is default integer, default logical or default real, and which has the VOLATILE attribute may be referenced during the execution of a segment that is unordered relative to one in which the coarray is defined. Otherwise: ... This sort of problem could be resolved only by requiring a coarray to have the VOLATILE attribute in all scoping units if it has it in any of them. An even nastier example is the following, and it is so nasty that most compilers reject it as invalid (though it seems to be valid Fortran 2003): MODULE Global INTEGER, SAVE :: Matthew[*] = 1 END MODULE Global PROGRAM Boggle USE Global SELECT CASE (THIS_IMAGE()) CASE(1) CALL John() CASE(2) CALL James() END SELECT PRINT *, Matthew[9] END PROGRAM Boggle SUBROUTINE John USE Global VOLATILE :: Matthew PRINT *, Matthew[9] END SUBROUTINE John SUBROUTINE James USE Global Matthew[9] = 2 END SUBROUTINE James There are some systems where that is effectively implementable only by providing VOLATILE coarray semantics for all coarrays, with the consequent loss of efficiency. 2.3 Composite Objects --------------------- Consider the following program: PROGRAM Composite_1 INTEGER, VOLATILE :: value(100)[*] = 0 SELECT CASE(THIS_IMAGE()) CASE(1) value(:)[9] = 123 CASE(2) value(:)[9] = 456 END SELECT SYNC ALL IF (THIS_IMAGE() == 9) PRINT *, value END PROGRAM Composite_1 An array is an object, and 'value' has type INTEGER, so many users will assume that the elements of 'value' is either all 123 or all 456, but many implementations will deliver a mixture. There is nothing in the current wording that states or even implies which. Another example is: PROGRAM Composite_2 INTEGER, VOLATILE :: value(100)[*] = 123 SELECT CASE(THIS_IMAGE()) CASE(1) value[9] = SUM(value[9]) CASE(2) PRINT *, value[9] END SELECT END PROGRAM Composite_2 Is this required to print all values the same, or may some values be 123 and others 12300? And, in either case, where is it specified? 2.4 Use in Protected Contexts ----------------------------- There are several contexts where a variable may not be defined or undefined except in specific ways, but in not all of those is it clear whether that covers the VOLATILE coarray case when the dubious action is performed by another image. For example, 8.1.7.6.2 paragraph 2 says "..., the DO variable may not be redefined nor become undefined while the DO construct is active". But consider the following program: PROGRAM Do_what INTEGER, VOLATILE :: a[*] = 1 INTEGER :: b(100) = 0 SELECT CASE (THIS_IMAGE()) CASE(1) READ (*,*) a[2] = 2 CASE(2) DO a = 1,100 b(a) = a END DO WRITE (*,*) "Kilroy was here" END SELECT END PROGRAM Do_what What does "while the DO construct is active" mean as applied to a separate image? Let us assume that the user did not type a newline until he saw the message "Kilroy was here" appear, which would mean that the DO construct had finished. Would that make the above program correct? A variant question relates to the same program, but with the READ and WRITE removed. Would the correctness of the program depend on whether the processor happened to execute image 2 before executing image 1? And, if not, why would the answer differ from the previous one? Note that this sort of issue causes a lot of trouble to users who test their parallel code on workstations, and then run it on massively parallel computers. The former often run threads sequentially. Resolving these issues would be time-consuming, as each case would need finding and careful consideration. 2.5 Using WHERE and Masks ------------------------- This may be an oversight, but does not seem to be forbidden. However, it is a very good example of how the lack of a precisely defined memory model causes problems with the interaction of VOLATILE coarrays and existing Fortran facilities. Similar issues arise with vector subscripts, FORALL and in several other constructions; even quite reasonable programmers might well write elemental functions that expose this sort of issue. Hence simply forbidding such uses isn't a simple task, and would need changes to many parts of the standard. PROGRAM Where_is_it_at INTEGER, VOLATILE :: a(1000)[*], b(1000)[*] INTEGER :: i IF (THIS_IMAGE() == 9) THEN DO I = 1,1000 a(i) = MOD(17*17*17*i,1024) b(i) = MOD(19*19*19*i,1024) END DO END IF SYNC ALL SELECT CASE (THIS_IMAGE()) CASE(1) WHERE (MOD(a(:)[9],13) == 5) b(:)[9] = a(:)[9] CASE(2) WHERE (MOD(b(:)[9],13) == 5) a(:)[9] = b(:)[9] END SELECT SYNC ALL IF (THIS_IMAGE() == 9) PRINT *, a, b END PROGRAM Where_is_it_at Fortran could either forbid this, or specify what it means, but the current situation is that it is permitted without having even a guessable meaning. 3. Reduced serial optimisation ------------------------------ In strict Fortran 2003, adding the VOLATILE attribute adds some constraints, but no new semantics; extra semantics can be added only by processor extensions (in this context, including relevant companion processor support). In particular, in a sequence of statements, no object can change value unless the program defines or undefines some identifier associated with it (possibly in a subprocedure). However, VOLATILE coarrays can be changed at any time by other images, without needing any processor extensions, and doing so is defined behaviour. A processor therefore needs to allow for this, and not use any optimisations that would give incorrect results if it happens. As referencing VOLATILE coarrays is allowed even in PURE functions, this has a major impact. The example shown here is one where an array is initialised 'the wrong way round'; several compilers currently optimise such things by reversing the order of the loops. It is also a case where common expression elimination can save a lot of time; most compilers will do that, even at low levels of optimisation. It shows that the introduction of VOLATILE coarrays means that neither optimisation may be performed without changing the results, even though there is no use of either coarrays or VOLATILE in the procedure being compiled. This problem could be resolved only by making any reference to VOLATILE coarrays in functions or PURE subroutines undefined behaviour. This would probably be regarded as unacceptable. 3.1 An Example -------------- Consider the case of a Fortran processor that does not define any semantics for VOLATILE beyond those required by Fortran 2003; that is the usual case, and is likely to continue to be. Now consider the following external subroutine: SUBROUTINE Fred (arg, m, n) INTEGER, INTENT(IN) :: m, n INTEGER :: arg(m,n) INTERFACE PURE FUNCTION Joe (x) INTEGER :: Joe INTEGER, INTENT(IN) :: x END FUNCTION Joe END INTERFACE INTEGER :: i, j DO i = 1,m DO j = 1,n arg(i,j) = Joe(arg(i,j))+Joe(0) END DO END DO END SUBROUTINE Fred Because Joe is marked PURE (and, strictly, even if it had not been), no objects other than the array arg can become defined in that loop, in Fortran 2003. Hence the processor need evaluate Joe(0) only once, and can arbitrarily reorder the loops for increased memory performance; many existing compilers do either or both of those. However, with VOLATILE coarrays, that is no longer possible. Consider the following module, program and function Joe. MODULE Global INTEGER, VOLATILE, SAVE :: Pete[*] = 1 END MODULE Global PROGRAM Main USE Global INTERFACE SUBROUTINE Fred (arg, m, n) INTEGER, INTENT(IN) :: m, n INTEGER :: arg(m,n) END SUBROUTINE Fred END INTERFACE INTEGER :: array(1000,2000), i, j DO I = 1,1000 DO J = 1,2000 array(i,j) = 2000*i+j END DO END DO SELECT CASE (THIS_IMAGE()) CASE(1) CALL Fred(array,1000,2000) CASE(2) DO I = 1,1000000000 Pete[9] = I END DO END SELECT PRINT *, array END PROGRAM Main PURE FUNCTION Joe (x) USE Global INTEGER :: Joe INTEGER, INTENT(IN) :: x Joe = Pete[9]+x/3 END FUNCTION Joe Here, Joe references (not defines) Pete[9], image 1 calls Fred and hence Joe, and image 2 defines Pete[9] in open code. I can find nothing in the standard that even discourages this. The order in which Joe is called is now visible to the program, which contradicts NOTE 8.30 and NOTE 12.51 and prevents some forms of the above optimisations. Note that there is no use of either coarrays or VOLATILE in subroutine Fred; the introduction of VOLATILE coarrays has therefore reduced the possibilities for optimisation even in code that does not use either. 4. Behaviour on Commodity Clusters ---------------------------------- Consider the subroutine Refinement in a program fragment like the following: MODULE Data INTEGER, VOLATILE :: table(1000)[*] INTERFACE ELEMENTAL LOGICAL FUNCTION Valid (value) INTEGER, INTENT(IN), VALUE :: value END FUNCTION Valid END INTERFACE END MODULE Data SUBROUTINE Refinement (index) USE Data INTEGER :: index(:,:), n DO n = 1,UBOUND(index,2) IF (index(1,n) > 0 .AND. & .NOT. Valid(table(index(1,n))[index(2,n)])) & index(1,n) = -1 END DO END SUBROUTINE Refinement In general, a compiler cannot be sure that the VOLATILE coarray table will not be updated by another image, and therefore will need to fetch each value of 'table' sequentially. There are better ways to write this, but all simple versions have similar problems in the case where 'table' is too large to store on a single image. Doubtless there are better examples, too. The issue here is how this sort of code can be implemented, and the consequences of possible implementation approaches. Nobody will expect it to be as efficient as using local data, but the question is whether it can be implemented reasonably portably and reasonably efficiently on commodity clusters. The requirement is for image A to access data on image B while the latter is occupied doing something else. All of the possible implementation approaches known to the authors will be discussed separately. Note that certain implementation strategies can lead to deadlock, even in programs that contain no deadlock in their logic; consider a program like the following: PROGRAM Deadlock INTERFACE ! The initialisation of mutexes to an unlocked state is omitted ! for clarity; the companion processor is assumed to create and ! initialise at least mutexes indexed by arguments 8 and 9. ! Otherwise, these interfaces are modelled on the POSIX calls ! pthread_mutex_lock and pthread_mutex_unlock. SUBROUTINE Mutex_lock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_lock SUBROUTINE Mutex_unlock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_unlock END INTERFACE INTEGER :: value[*] = 0, i IF (THIS_IMAGE() == 1) THEN CALL Mutex_lock(9) ELSE IF (THIS_IMAGE() == 3) THEN CALL Mutex_lock(8) END IF SYNC ALL SELECT CASE(THIS_IMAGE()) CASE(1) DO i = 1,CO_UBOUND(value) value[i] = 123*i END DO SYNC MEMORY ! One CALL Mutex_unlock(9) CASE(2) CALL Mutex_lock(9) SYNC MEMORY ! Two DO i = 1,CO_UBOUND(value) PRINT *, value[i] END DO CALL Mutex_unlock(8) CASE(3) CALL Mutex_lock(8) END SELECT END PROGRAM Deadlock If the call to Mutex_lock in image 3 blocks, and coindexed objects owned by it cannot be accessed by another image while it is in that state, the above program will deadlock. The Fortran processor has obviously no control over the code of Mutex_lock and Mutex_unlock, and so cannot prevent them from blocking. In the following, the classification of implementation strategy refers to their viability for use on the ubiquitous commodity clusters. 4.1 Cache-coherent Shared Memory -------------------------------- Currently, there are commodity systems that provide this for up to about 16 cores (i.e. images), and a few specialist companies provide it up to about 1,000. There are few problems with implementing VOLATILE coarrays on such systems. However, note that the specification problems remain, as each architecture defines a slightly different set of guarantees; see, for example: http://www.intel.com/products/processor/manuals/318147.pdf http://download.boulder.ibm.com/ibmdl/pub/software/dw/library/ es-archpub2.zip http://www.sparc.org/standards/SPARCV9.pdf There have been many attempts, over many decades, to provide cache-coherent virtual shared memory over a cluster of separate systems (i.e. with distributed memory at the hardware level). None have succeeded, and any claims that it will be delivered "real soon now" are implausible. This is not a viable implementation strategy. 4.2 Special Hardware and Operating System Support ------------------------------------------------- Many specialist vendors (e.g. Cray) provide hardware or operating system extensions that have the effect of letting an application on one system access the memory of another, transparently. That is, so that none of the applications on the latter need include any logic to enable such access. This is often called RDMA (Remote Direct Memory Access). Again, experience is that this works. However, neither commodity hardware nor commodity software support any such mechanism, and so it cannot be used on commodity clusters. The nearest to a commodity interconnect that has any such support is InfiniBand, where it is generally believed that the protocol enables such access. Unfortunately, the specification is 2,000 pages long, and the general belief may not actually be correct. More importantly, its prevalent software implementation for commodity clusters, Openfabrics, has no such support. The only interconnects that deliver the appropriate functionality are vendors' own ones (e.g. Cray's) and Quadrics. The latter can be attached to some commodity clusters, but is expensive, specialist and rare. This is not a viable implementation strategy, until and unless Openfabrics delivers such support. 4.3 MPI-2 One-Sided Communication --------------------------------- On the face of it, this would appear to be a widely available implementation of RDMA, but investigation shows that to be false. Very few (if any) applications currently use MPI-2 one-sided communication, and it is unclear how reliable, complete and efficient its implementations are. Even more seriously, the only 'true' one-sided mechanism in MPI-2 is MPI_Win_lock (MPI-2 6.4.3), and MPI allows that to be restricted to memory allocated by MPI_Alloc_mem, which may not be feasible for all Fortran compilers on all systems. Furthermore, MPI-2 11.7.2 states that progress with a transfer is not required until the target process (i.e. the image that owns the data) next reaches an MPI call, and therefore may take an unbounded amount of time. This would lead to deadlock in some Fortran VOLATILE coarray programs, such as the program Deadlock above. Also, Fortran VOLATILE coarray accesses use a much lower granularity than most MPI transfers, and it is unclear whether a viable MPI-2 implementation would be efficient enough for VOLATILE coarrays. This is almost certainly not a viable implementation strategy. 4.4 'Cray SHMEM' ---------------- This is the message passing interface that originated on Cray, and copied to many other parallel systems; it is not the 'System V' shared memory segment interface also called shmem. The only call that would helps is SHMEM_Quiet (SHMEM_Fence and SHMEM_Wait affect actions on the local node only, and SHMEM_Barrier is a collective). It appears that SHMEM_Quiet was introduced for the T3E. There appears to be no implementation of SHMEM for distributed memory systems that includes SHMEM_Quiet, except for Cray systems and the specialist interconnect Quadrics. This is not a viable implementation strategy. 4.5 Interrupting the Image to Complete the Transfer --------------------------------------------------- In theory, the processor could use some form of interrupt mechanism to trap transfers to the executing image (i.e. the one that owns the data), handle the transfers and then continue processing. The only currently relevant mechanism is signals, and doing I/O in them is undefined behaviour in C99 (7.14.1.1 The signal function, paragraph 5) and POSIX (sigaction: APPLICATION USAGE, paragraph 3). Many systems provide extensions to POSIX in this area, but few are much of an improvement, and most do not provide enough supported functionality to implement message passing in an interrupt handler. Experience with trying to use this mechanism on modern systems is that it is, at best, hopelessly unreliable. It could be made reliable only by major changes to the kernel design (i.e. adopting the old mainframe designs); that is implausible. This is not a viable implementation strategy. 4.6 Using a Separate Thread to Handle the Transfers --------------------------------------------------- A processor could require that there is at least one, permanently running, thread dedicated to message passing per system, and that the operating system provides coherent shared memory between that thread and the image execution threads. For performance, this also requires one core dedicated to message passing, because otherwise the latency of VOLATILE coarray accesses will be bound to the scheduler interval (typically 10 milliseconds on modern systems, compared with about 1 microsecond for Ethernet or InfiniBand messages). A factor of 10,000 on the latency is a very serious performance hit. It also requires the hardware and operating system to support a memory update in one thread becoming visible to another thread, expeditiously, with no action by the second thread. That is undefined behaviour in POSIX (4.10 Memory Synchronization), and is somewhat unreliable in practice, because of scheduling, memory consistency and other problems. Thread scheduling control is optional in POSIX, and its semantics are largely implementation specific (POSIX 4.13 Scheduling Policy); it also does not address the memory consistency issue. Such mechanisms could be made reliable for many or most systems only by non-trivial kernel enhancements. This is perhaps be the best implementation strategy for commodity clusters, but its unreliability, system dependence and potentially poor performance are serious problems. 4.7 Polling in compiled code ----------------------------- The Fortran processor can obviously insert checks for pending transfers (i.e. poll for them) into the executed code, but needs to put them inside all long-running loops and all potentially blocking primitives (e.g. I/O statements). That could reduce application performance by a large factor, because it interferes with the pipelining that is so important on almost all modern processors. It also does not address the problem of blocking in a companion processor. This is a viable implementation strategy only if performance and the use of companion processors are not of major consequence. 4.8. Using the MPI and GASNet Progress Model -------------------------------------------- MPI and GASNet (see section F.2 below) have the concept of a progress engine, where MPI or GASNet calls check for and service all pending actions, but no progress need occur between them. Experience shows that this is not what most shared memory programmers expect; it is very hard to explain to them that what appears to be transparent shared memory must actually be programmed like two-sided message passing. It is also arguably a major change of specification for Fortran coarrays; whether or not it is, Fortran would need to define when processors are required to progress, so that programmers can write correct, reasonably portable programs. On this topic, it should be noted that the implementation of GASNet is such that UPC programs that do not make expeditious progress actually fail, and do not merely run inefficiently. Specifying that explicit action is needed on the image that owns coindexed objects to ensure progress (even if that image does not access the relevant coarray) is a viable strategy, but needs significant changes to the standard; also many people will regard it as unacceptable for VOLATILE coarrays. This problem is much less serious for segments (i.e. non-VOLATILE coarrays), because the higher granularity and need for explicit image control statements is likely to lead to a more collective programming style. The problem does not arise at all for SYNC ALL and CRITICAL, of course, which are already collective. This is probably not a viable implementation strategy for VOLATILE coarrays, but probably is for segments.