ISO/IEC JTC1/SC22/WG5-N1744 Coarrays and Memory Models -------------------------- Nick Maclaren on behalf of the BSI Fortran panel, 7th October 2008. This paper reflects the views of the majority of the panel. 0. Introduction --------------- This paper addresses some points where we feel that the standard is ambiguous or confusing about the precise semantics of coarrays, mostly as far as non-VOLATILE objects are concerned; it does not propose any changes to what seem to be the agreed semantics. Most of this paper is written assuming that VOLATILE coarrays have been excluded. If they were to be included, there are some very tricky questions about how they should interact with segments, and the SYNC MEMORY statement in particular. Issues with VOLATILE are discussed separately in the paper N1745 "VOLATILE Coarrays", but their basic memory model aspects are described at the end of this one. This paper takes the viewpoint of a programmer or implementor reading the Fortran standard, who has not been involved in any of the coarray standardisation discussions. The issues are not about what the standard intends, but about whether its words (and especially the normative text) explicitly and correctly state its intent. Unfortunately, there is no consensus on precisely what memory models are appropriate for shared memory parallelism, either theoretically or in practice, and experience with supporting scientific programmers is that many of their 'hard' difficulties derive from them making reasonable assumptions that do not happen to be true. None of the issues raised are purely theoretical, as the author has seen all of them cause major confusion in reasonable code using similar interfaces, often written by experienced users. Section 2.3 of paper J3/08-126, by Mellor-Cummey, Adhianto and Scherer also says that a precise memory model is a key feature, and Fortran currently lacks it. There are 4 proposals in this paper, 2 for normative text and 2 for new NOTEs, in sections 1.1, 2 and 3.3. 1. Sequential Consistency ------------------------- Section 8.5.1 specifies how segments are ordered, but does not say anything about ordered segments where the precise order is not specified by the standard, implicitly or explicitly. This issue has been discussed extensively by theoreticians and practitioners in parallel programming, but with little consensus. Consider a program like the following: PROGRAM Sequential_Consistency INTERFACE ! The initialisation of mutexes to an unlocked state is omitted ! for clarity; the companion processor is assumed to create and ! initialise at least mutexes indexed by arguments 8 and 9. ! Otherwise, these interfaces are modelled on the POSIX calls ! pthread_mutex_lock and pthread_mutex_unlock. SUBROUTINE Mutex_lock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_lock SUBROUTINE Mutex_unlock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_unlock END INTERFACE INTEGER :: one[*] = 0, two[*] = 0, z(8:9) = 0, i SELECT CASE(THIS_IMAGE()) CASE(1) CALL Mutex_lock(8) SYNC MEMORY one[8] = 123 SYNC MEMORY CALL Mutex_unlock(8) CASE(2) CALL Mutex_lock(9) SYNC MEMORY two[9] = 456 SYNC MEMORY CALL Mutex_unlock(9) CASE(3) CALL Mutex_lock(8) SYNC MEMORY z(8) = one[8] SYNC MEMORY CALL Mutex_unlock(8) CALL Mutex_lock(9) SYNC MEMORY z(9) = two[9] SYNC MEMORY CALL Mutex_unlock(9) WRITE (3,*) z(8), z(9) CASE(4) CALL Mutex_lock(9) SYNC MEMORY z(9) = two[9] SYNC MEMORY CALL Mutex_unlock(9) CALL Mutex_lock(8) SYNC MEMORY z(8) = one[8] SYNC MEMORY CALL Mutex_unlock(8) WRITE (4,*) z(8), z(9) END SELECT END PROGRAM Sequential_Consistency This can write any of the following pairs of values to units 3 and 4: '0 0', '123 0', '0 456' and '123 456'. Most combinations of these pairs are possible, even with sequential consistency [Lamport] (see below). A) The principal question is whether unit 3 can contain '123 0' and unit 4 can contain '0 456'; if this is so, then the execution is not sequentially consistent. B) If it is possible, then the subsidiary question is exactly what Fortran requires in terms of consistency between segments. C) If it is not possible, then the subsidiary question is what normative text states that it is not possible. There are many ways of phrasing this question, but one of the simpler is that it asks whether image execution control is required to apply to memory banks (i.e. access to storage across all images) as well as image execution. That is Lamport's second restriction (see below), and does not appear to be stated in Fortran, even indirectly. 1.1 Proposal ------------ There are three possible options: one is to specify sequential consistency in the currently unspecified case, another is to specify exactly what Fortran requires from the processor and guarantees to the programmer, and the third is to exclude the problematic case. Option 1 (sequential consistency): ---------------------------------- Because sequential consistency is the best understood parallel memory model, the one that is generally most favoured for shared memory designs, and one that is already required by the specification of SYNC ALL and SYNC IMAGES, Fortran could require it for all segments. If this is adopted, normative text along the following lines should be added to section 8.5.1 paragraph 5: Where multiple segments are ordered, but they can be executed in more than one complete order, the processor shall execute them in a sequentially consistent fashion. The standard could also provide a reference to Lamport's letter (see below), if the term 'sequential consistency' were felt to be unclear on its own. Option 2 (not sequential consistency): -------------------------------------- There is no consensus among parallelism experts about what alternative shared memory models are reasonable, and no standard terminology to define them. Fortran would therefore have to specify exactly what its specified behaviour is and when programmers can rely on it. Regrettably, none of the BSI panel have been able to make a proposal that they are sure is self-consistent enough for standardisation, but the matter is being considered. The minimum requirement is that there is an identifiable subset of uses that has specified behaviour, and that the standard clearly specifies the subset and its behaviour. Option 3 (removal of the difficult case): ----------------------------------------- SYNC MEMORY could be removed, leaving only SYNC ALL and SYNC IMAGES. The execution of a CRITICAL construct on different images would be explicitly stated to be completely ordered. NOTE 8.31 can be read to imply that this issue also arises with the use of CRITICAL and SYNC IMAGES. At best, that note is unhelpful, and may increase confusion; unless there is clear normative text specifying the model (or equivalent), it needs deletion or rewriting. 1.2 Reference ------------- Lamport gave a definition of sequential consistency in a letter in IEEE Transactions on Computers 28 pp 690-691, 1979, where he defined it as: The results of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. It is important to note that that definition, and most other work on shared-memory parallelism, assume an essentially flat memory model, but coarrays use a segmented one. Lamport considered the matter of 'memory modules' in his letter, and suggested the following restrictions in order to ensure sequential consistency: Each processor issues memory requests in the order specified by its program. Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue. The coarrays of an image can be considered as a memory module in this context, and it is fairly clear that the first restriction is imposed, but there seems to be nothing in the current draft that imposes the second restriction. Without that restriction, the coarray support is not sequentially consistent. An alternative way of viewing this issue is that sequential consistency requires that both that the execution on a single image is sequential, and that access to the data owned by a single image is. The current coarray specification requires the former, but does not seem to require the latter. 2. Data Storage --------------- Section 2.5.7 describes coarrays, and paragraph 4 allows an image to access a coarray owned by itself without using cosubscripts. The problem here is that Fortran does not explicitly state that the square bracket notation is merely syntactic sugar, with no special semantics. While cautious readers of the standard will not assume anything that is not explicitly promised, experience with similar parallel language extensions shows that at least some readers will misunderstand. This is made more confusing by using the term "access" in 2.5.7 paragraph 4; that is not a defined term, and is not widely used elsewhere in the standard in normative text. Because it is not used entirely consistently, the solution of defining it would be more trouble than it is worth. This is a relatively minor point if volatile coarrays are excluded, but a more serious one if they were to be included. This issue should be clarified in a note, such as: NOTE 2.18+ Accessing a coarray on its own image by using a set of cosubscripts that map to the image has exactly the same effect as accessing it without cosubscripts. In particular, the segment ordering rules (8.5.1) apply whether or not it uses cosubscripts to access the coarray. The authors of paper J3/08-126 also did not understand the intent, and asked this question in Section 2.3. 3. User-Defined Ordering ------------------------ Section 8.5.1 paragraph 5 permits 'user-defined ordering' to control whether two segments are ordered, but provides no further specification of what that means. Experience with supporting the users of parallel systems shows that misunderstandings of what ordering means are very common, and both of the issues described here arise in practice. Fortunately, the problems are very easy to resolve in Fortran, merely by stating that the permissible forms of user-defined ordering are processor dependent. That will not eliminate the problems, but will make the situation clear to all Fortran experts, whether or not they are parallel programming experts. 3.1 User-Defined Ordering and Separation in Time ------------------------------------------------ Many users will expect that temporal ordering (i.e. segment A precedes segment B in time) will imply consequential ordering (i.e. that the effects of A will be visible to B). However, that is a very onerous constraint to place on implementors, and few currently honour it for any parallel language. The chances of confusion are enhanced in Fortran, because section 2.4.1 paragraph 2 says 'Image execution is a sequence, in time, of actions', and the word 'time' can reasonably be read to mean physical time. Consider a large, distributed memory cluster with a consistent global clock. There are known, practical algorithms for providing such consistency with a proven accuracy of better than 4.T.log_2(N), where N is the number of images and T is the one-way latency for small messages. A user might reasonably assume that, if each segment is separated in time by more than twice the accuracy of the clock, then purely temporal ordering is adequate as a form of 'user-defined ordering'; there is nothing in the standard that even implies that it is not. While parallelism experts will know that is not intended, experience shows that many readers of the standard will not draw that conclusion, and there is nothing in the words that states they are wrong to do so. An example of that sort of coding is shown in the following program: PROGRAM User_defined_ordering INTERFACE ! This assumes that the subroutine TIMESTAMP returns the value ! of the global clock in seconds from a suitable starting point, ! and the global clock is consistent to better than 0.5 seconds. SUBROUTINE Timestamp (value) BIND(C) INTEGER, INTENT(OUT) :: value END SUBROUTINE Timestamp END INTERFACE INTEGER :: var[*] = 0, z DO CALL Timestamp(z) IF (z > 3*THIS_IMAGE()) EXIT END DO SYNC MEMORY ! The system and number of images are assumed to be such ! that this segment takes considerably less than 1 second ! to complete. DO n = 1,NUM_IMAGES() var[n] = THIS_IMAGE() END DO SYNC MEMORY CALL Timestamp(z) WRITE (*,*) THIS_IMAGE(), z END PROGRAM User_defined_ordering [ The function Timestamp could also be implemented as one that combines the values returned from the intrinsic DATE_AND_TIME into a single real or integer numeric number of seconds. It is described here in terms of a C function only because there are POSIX functions that provide exactly the semantics it assumes. ] Unfortunately, supporting this program requires the processor to perform a pairwise synchronisation around each of the SYNC MEMORY statements (within the resolution of the clock) with all of the other images, and not to proceed until the other images have acknowledged the synchronisation. That requires N^2 handshakes for the execution of that program, where N is the number of images. No parallelism expert would ever write such code, but the above conditions can reasonably be regarded as amounting to 'user-defined ordering'; the current wording of the standard implies that they do. Even quite experienced users will say that the output (i.e. the timestamps following the completion of each segment) demonstrates that the segments are ordered. Assuming that this is not intended, the fact that they do not should be clarified, assuming that is what is agreed. We make a suggestion in section 3.3. 3.2 User-defined Ordering and Progress -------------------------------------- There is another ambiguity in the current wording - that of 'progress'. The basic requirement is to ensure that a correct program will eventually terminate, and that a processor will not introduce artificial deadlock or indefinite livelock. The converse of that is to impose enough constraints on correct programs so that it is always possible. [ "Artificial deadlock" is used to refer to the case where there is no deadlock in the logic of the program, but the processor's implementation is such that the program deadlocks. "Indefinite livelock" is where a program gets into a situation where the processor's implementation causes it to get stuck in an indefinite loop with images waiting for each other. ] This is notoriously difficult to specify, and the best description seems to be in the MPI Report, especially MPI-1 4.12 and MPI-2 11.7.2. MPI imposes enough restrictions on programs that an implementation can always make progress, and explicitly requires implementations to make progress for all valid programs. Fortran coarrays are comparable to MPI-1 in the absence of user-defined ordering, but that introduces problems of the sort that are described in MPI-2 11.7.2, where MPI does not require progress. There is nothing in the wording of coarrays that makes it clear that a processor must make progress for all conforming programs. It is likely that some implementors will take the approach that they need do so only for all 'reasonable' programs, and each implementor will use a different interpretation of the word 'reasonable'. There is more on this issue in the paper N1745 "VOLATILE Coarrays", but this paper is intended to be comprehensible on its own. The following is an example of how the problem can arise (the mutex interface is identical to that used in the previous example): PROGRAM Progress INTERFACE ! The initialisation of mutexes to an unlocked state is omitted ! for clarity; the companion processor is assumed to create and ! initialise at least mutexes indexed by arguments 8 and 9. ! Otherwise, these interfaces are modelled on the POSIX calls ! pthread_mutex_lock and pthread_mutex_unlock. SUBROUTINE Mutex_lock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_lock SUBROUTINE Mutex_unlock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_unlock END INTERFACE INTEGER :: value[*] = 0 SELECT CASE(THIS_IMAGE()) CASE(1) CALL Mutex_lock(8) CASE(2) CALL Mutex_lock(9) END SELECT SYNC ALL SELECT CASE(THIS_IMAGE()) CASE(1) value[3] = 123 SYNC MEMORY CALL Mutex_unlock(8) CASE(2) CALL Mutex_lock(8) SYNC MEMORY PRINT *, value[3] CALL Mutex_unlock(9) CASE(3) CALL Mutex_lock(9) END SELECT END PROGRAM Progress Images 1 and 2 need to transfer a non_VOLATILE value that is owned by image 3, but image 3 is in a companion processor call that will not return until after images 1 and 2 have transferred that data! The question is whether the program is correct and, if not, where in the Fortran standard that is stated. Note that this is different from the existing, serial case where a program goes into an infinite loop, because there is no deadlock in the logic of the above program, only in the processor's implementation. 3.3 Proposal ------------ The mechanisms that can be used to provide user-defined ordering should be stated to be processor dependent, which would make it clear that users cannot assume a particular behaviour without checking the appropriate documentation. Normative text along the following lines should be added to section 8.5.1, somewhere around paragraph 5: The mechanisms that may be used to provide user-defined ordering are processor dependent. It would also be desirable to encourage implementors to provide the mechanisms that are used by current parallel programs for ordering (primarily I/O to external files and using synchronisation objects, like semaphores). A note like the following would help, though it is clearly not critical. NOTE 8.29+ A processor should include at least the following as potential user-defined ordering mechanisms: Closing a unit connected to an external file, followed by opening a unit connected to an external file. Writing to a unit connected to an external file, followed by executing the FLUSH statement for it, followed by reading from a unit connected to an external file. Two calls to impure subroutines that are provided by a companion processor. Note that the fact that images have separate units is irrelevant, as they could be connected to one another through a series of FIFOs. NOTE 8.39 already gives an example for the third mechanism. 4. VOLATILE Issues ------------------ Specifically VOLATILE issues are described in the paper N1745 "VOLATILE Coarrays", but some general memory model issues are described here, including raising the question of interactions between segment and VOLATILE coarray consistency. There is a little overlap between the papers, to avoid the need for cross-referencing. It is extremely unclear whether VOLATILE coarrays should use the same memory model as segments. That could have a major performance impact on large systems. However, if they do not use compatible, well-defined memory models, many people will get confused. Section 2.3 of paper J3/08-126 refers to this point. Realistic programs will mix the use of VOLATILE coarrays and partially-ordered segments in quite complicated ways, which confuses the issue further. primitives. 4.1 Basic Memory Model ---------------------- This is discussed in more detail in the paper N1745 "VOLATILE Coarrays", but a simplified example is given here: PROGRAM Memory_Model_1 INTEGER, VOLATILE :: one[*] = 0, two[*] = 0 INTEGER :: p, q SELECT CASE(THIS_IMAGE()) CASE(1) one[8] = 123 CASE(2) two[9] = 456 CASE(3) p = one[8] q = two[9] WRITE (3,*) p, q CASE(4) q = two[9] p = one[8] WRITE (4,*) p, q END SELECT END PROGRAM Memory_Model_1 The question is whether unit 3 can contain '123 0' and unit 4 can contain '0 456'. There seems to be no relevant normative text making this clear. If we refer back to program Sequential_Consistency, we are asking the same question for segments. The question is whether VOLATILE coarray accesses have the same consistency requirements as ordered segments with an unspecified ordering. If they do not, many users will get very confused. But, if they do, it means that each access to a VOLATILE coarray has to be implemented very much as if it were a complete segment. Note that this example is the simplest that shows the issue; more complex, but still realistic, examples are available from the author. 4.2 Summary ----------- SC22WG21 (C++) has spent a considerable amount of time debating the consequences of these issues, and defining a precise memory model. Fortran would need to do the same if VOLATILE coarrays were to be included. See, for example, SC22WG21 paper N2556 and the numerous papers it references: http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2556.html This aspect should be considered when discussing the paper N1745 "VOLATILE Coarrays".