Some comments on the recent WG5 papers on coarrays =================================================== R. Bader, LRZ November 6, 2008 ISO/IEC JTC1/SC22/WG5-N1756 A number of WG5 papers by Nick Maclaren (N1744,N1745,N1748,N1749,N1751) contain example programs and a critique of the coarray concept as defined within the present Fortran 2008 draft, in particular concerning VOLATILE coarrays and communication with a passive image. This paper is an attempt to understand some of the identified issues in the context of the draft standard. 1. Prerequisites and Assumptions: ================================= Assessment of the above papers will be based upon the interpretation of the relevant parts of the draft Fortran 2008 standard performed in this section. 1.1 Definition of VOLATILE (5.3.19): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The VOLATILE attribute only refers to the possibility of external updates to an object given it, not the manner in which this update is performed. In particular, no atomicity of external memory updates on the level of the object is guaranteed. This interpretation appears to be shared by MR&C, Fortran 95/2003 Explained: "Even if only one process is writing to the variable and the Fortran program is reading from it, ... it is possible to read a partially updated ... value." In the draft standard, Note 5.24 appears to be a bit misleading; a more helpful formulation might be "The Fortran processor should use the most recently available state of a volatile object when a reference is performed by the processor. Likewise, it should make the most recent Fortran state available when a reference is performed by the external mechanism. It is the programmer’s responsibility to manage any interaction with non-Fortran processes, including the integrity of the referenced object." For a VOLATILE object, the processor is expected to * not register optimize the object * not move assignments from/to the object around during its optimization attempts Any code segments involving VOLATILE objects can hence be expected to suffer (considerable) performance degradation. 1.2 VOLATILE coarrays (8.5.1, para6): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As an exception to the rules on definedness of coarray references, which normally require an explicit image synchronization between accesses from different images, it is possible to omit synchronization for VOLATILE coarrays provided the referenced object is of type default real, default integer or default logical. This restriction, according to N1747, was imposed, " ... because memory updates/references to such a variable need to be atomic: referencing the value on one image concurrently with an update on another will either get the previous value or the new value. Such atomic memory operations cannot be guaranteed in general." As a consequence, the object must additionally be a scalar, since the last cited sentence would also apply to arrays of the above types. It may be necessary to change the wording of 8.5.1, para6 to "A scalar coarray that is a default integer, default logical, or default real, and which has the VOLATILE attribute may be referenced during the execution of a segment that is unordered relative to the execution of a segment in which the coarray is defined." to make this more clear. Furthermore (based on e-Mail discussion with John Reid) the following additional Note is suggested for 8.5.1: ------------------------------------------------------------------------------- NOTE 8.29a A scalar coarray that is volatile and of type default integer, default real, or default logical is 'atomic' in the sense that read accesses from any image other than those altering its value will obtain either the value previous to an alteration, or the value after an alteration. It remains the programmer's responsibility to prevent race conditions for such volatile coarray references by suitable formulation of the algorithm (ref. to example in Note 8.38). ------------------------------------------------------------------------------- In any case, we have here an extension of the semantics of VOLATILE for the indicated coarray objects, as compared to the original definition in 1.1. This additional property is needed for the code in Note 8.38 to work. Note that formally the VOLATILE attribute is only exploited on image Q in that example. Due to the additional semantics, a reliable implementation for commodity clusters will probably incur an even larger overhead than normal VOLATILE objects. 2. Comments on paper N1745: =========================== In section 1, the author advocates three possibilities for a better approach. The first of these, locks, will probably be included in the final standard anyway. However, while it is possible to implement the spin loop from Note 8.38 in terms of locks, all attempts I've seen incur some additional overhead, either due to pre-synchronization, or to potential lock contention. The solution based on VOLATILE coarrays may hence still be the most efficient one available. Locks will show their strengths in other situations. The second one, atomic datatypes and operations, is in effect what is already there (see 1.2 above). One could of course consider introducing an additional separate attribute, say ATOMIC, for such (and only such) an object, and add special (generic) intrinsics for R, W, TAS, CAS etc. The VOLATILE attribute would then still be advantageous since it provides the effect of an object-specific SYNC MEMORY, saving on overhead if many other memory operations are outstanding (an additional SYNC MEMORY would otherwise be required within the spin loop of Note 8.38!). Furthermore, it appears that anything beyond simple atomic read and write needs special hardware support, so may be difficult for distributed memory systems anyway. (Note added in writing: As of Nov 6, there exists a suggestion by Aleks Donev (N1753) which provides a facility to completely decouple atomic reads/writes from VOLATILE. This appears to completely solve the problem from the standardization point of view, although efficiency on DMS may still be under debate.) The third one will be addressed in the coarray TR. I agree this is especially important in the light of being able to map to reduction hardware available in newer interconnects. In section 2, some examples are provided to illustrate inconsistencies in the standard. Example 2.1 ("Lack of safety") correctly describes what will happen. It is, after all, a program with a race condition, in other words, an ill-defined parallel algorithm. Also note that that if VOLATILE were removed as proposed, the program would be non-conforming. There are then various ways to make it conforming again, depending on what result one wishes to achieve. Example 2.2 ("Varying the scope") appears to be addressed by J3/08-290. The argument is reasonable, especially in the light of the additional atomic semantics of VOLATILE coarrays. (I'm not sure wether the formulation in J3/08-290 is sufficient to ensure that a VOLATILE coarray dummy argument is rejected by the processor). Example 2.3 ("Composite object") is non-conforming since the VOLATILE coarray is not a scalar and hence the restrictions from 8.5.1 apply. Example 2.4 ("Protected Context") is non-conforming since the cited restriction is violated by the object in question being VOLATILE within the scope of the DO loop. This would apply even if the CASE(1) statements were not present. Example 2.5 ("Where and Masks") is non-conforming since the VOLATILE coarray is not a scalar and hence the restrictions from 8.5.1 apply. In section 3, the following points are made: * Using VOLATILE reduces serial optimization. This is true, and implies that users must be properly educated (like in the use and misuse of other language features). * The example used to illustrate this uses a PURE function, which of course should be optimisable. This issue, an oversight, is addressed in J3/08-284. In section 4 ("Behaviour on Commodity Clusters"), the example with the INTEGER, VOLATILE :: table(1000)[*] has a high likelihood of being non-conforming since again the object is not scalar, and there may be updates coming in from other images in an unordered segment. This one would probably be a nice candidate for using locks, by the way (after removing the VOLATILE attribute). The example program "Deadlock" will be discussed in the comments on N1744 below. I am not going through the list of implementation choices since large-scale efficiency of VOLATILE coarrays is not the main point of having them. Finally, the reference to the C++ standardization efforts with respect to the memory model (actually at the end of N1744) appears to be relevant for shared memory processing, but not necessarily for segmented memory processing. Final remarks on N1745 ~~~~~~~~~~~~~~~~~~~~~~ * An argument can be made to disallow the VOLATILE attribute for non-scalar coarray objects as well as coarray objects not of type default logical, default integer or default real. This would indeed prevent users from doing stupid things, and it would also obviate the need to disambiguate between the two kinds of VOLATILE semantics (atomic vs. non-atomic). If the atomic calls suggested by Donev are put into the standard, VOLATILE coarrays could be disallowed completely. 3. Concerning Paper N1744: ========================== Section 1 (Sequential Consistency): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The author seems to believe that in absence of the user defining a suitable synchronization sequence the processor should impose one. This appears contrary to the spirit of coarray programming which intends to minimize the constraints as far as possible to achieve improved performance. Suppose we have a coarray program with 4 images and 2 segments per image. The first segment may be a bit load imbalanced (think irregular lattices) and the second one is a CRITICAL block collecting things. Under the author's suggestions 1 or 3 we might well get this (unit timesteps downward): Image 1 2 3 4 ----------------------------- Segment 1 1 1 1 1 1 1 1 1 1 2 2 2 2 while the imbalance would be very nicely hidden Image 1 2 3 4 ----------------------------- Segment 1 1 1 1 1 1 1 2 1 1 2 1 2 2 if we simply don't care. OK, I've given the worst case, but on the other hands there were only very few images ... So the answer with respect to sequential consistency is: * The user must fix the segment ordering if his algorithm requires it, and should not do so if it doesn't. Tools for identifying race conditions are welcome. * Sequential consistency may be important with respect to the algorithm i.e., running the algorithm with one image only should, if supported, yield consistent results with a many-image run (typically within some specified precision, due e.g. to reordering changes in reductions). Section 2 ("Data storage"): ~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is addressed in J3/08-290 Section 3.1 ("User-defined ordering ...") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ According to 8.5.1, ordering (be it via image synchronization statements or user-defined constructs using SYNC MEMORY) does imply consequential ordering. Otherwise the statements on definability of coarrays in para 6 of that section would not make any sense. For the programmer, this does mean that e.g., SYNC ALL should be very carefully used since this may transfer many not-yet-needed outstanding buffers. In particular, the example program is non-conforming. The SYNC MEMORY always only refers to the local image (hence there is no N**2 effect), and segment 2 on image 1 is in fact unordered with respect to segment 2 on all other images. The fact that in many cases "correct" results will be printed out does not disprove this. Hence, this code *cannot* serve as an example for user-defined ordering. Section 3.2 ("... Progress"): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I consider the example a bit of a red herring since deadlock situations with companion processors or I/O may also occur if only two images are involved. The author has however, in my opinion, triggered a bug in the specification that must be fixed in 8.5.1 para6. In the first bullet of para6, the following situations are covered: ---------------------------------------------------------------- Legend: P, Q, ... are image numbers a is a coarray S(XY) is a pairwise sync which induces segment ordering for the two involved images. Time goes downward ... whatever that means. ---------------------------------------------------------------- P Q | | | | a = ... S(PQ) ~~~~~~~~~~~~~~~ RAW | | ... = a[Q] |<------------| and further diagrams covering WAW (push instead of pull by P), WAR. What is however not covered correctly are (at least) the cases P Q R | | | | ... = a[R] |<-----------| S(PQ) ~~~~~~~~~~~~~~~ | WAR (R "passive") | | | a[R] = ... |------------------------->| | | | S(PR) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | S(QR) ~~~~~~~~~~~~~~ | | | | ... = a[R] |<-----------| RAW (R "passive") | | | (or WAW, R "passive", not shown) Indeed it appears the present formulation of the draft standard requires S(PQ) twice in the above diagram, making it look like a one-sided MPI call with a passive partner (not implemented e.g., in MPICH2 or Intel MPI for good reason). In my opinion, R should only be passive with respect to references to a, but not with respect to requiring a sync images (/P,Q,R/) (as drawn in the diagram) for the RAW case, as images P and Q do. However, for WAR indeed only S(PQ) is needed. So I think it is appropriate to introduce the concept of an image being "owner" of a coarray and changing the first bullet to cover all conceivable transactions with the owner (I may still have overlooked something). Section 3.3 ("... Proposal"): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I'd suggest replacing "The mechanisms that may be used to provide user-defined ordering are processor dependent." by "Additional, processor dependent mechanisms may be used to provide user-defined ordering" since VOLATILE coarrays or atomic intrinsics are already available for that purpose.