ISO/IEC JTC1/SC22/WG5-N1748 Clarifications of Coarray Memory Model -------------------------------------- Nick Maclaren and Aleksandar Donev This paper arises out of Email discussions, especially between the authors, and covers all of the points that have been at issue over the basic segment model. It addresses exactly the same issues as N1744, but attempts to provide alternative (and possibly more acceptable) wording. It assumes that VOLATILE coarrays are excluded. Proposed changes ---------------- 1) There should be new NOTE in 2.4.5, following paragraph 3, along the following lines: "NOTE 2.11+ The above rules, taken together, define what meant by 'a sequence, in time' (2.4.1). All that a conforming program may assume is that actions take place in the statement that performs them (except when explicitly stated otherwise), executed statements are totally ordered within a segment, the segments executed by a single image are totally ordered, and the segments executed by separate images are partially ordered by image control statements (8.5.1)." This is not the happiest way of saying this, but the problem is that the existing wording is very serial, and uses terms like "a sequence, in time". Serial time is a well-defined concept, but parallel time is not. It is necessary to mention that actions take place logically within a statement, as many established parallel models differ - as does Fortran asynchronous I/O. Fortran's abstract model may be the original but, regrettably, most modern computer scientists speak a different language, and that is what most implemenors and programmers will have been taught. 2) There should be a new NOTE in 2.5.7, along the following lines: "NOTE 2.18+ Accessing a coarray on its own image by using a set of cosubscripts that map to that image has exactly the same effect as accessing it without cosubscripts. In particular, the segment ordering rules (8.5.1) apply whether or not it uses cosubscripts to access the coarray." Multiple experienced readers have already had difficulty with the current wording (see J3/08-126 - the first author had the same problem), and this clarifies the intent. 3) 8.5.1 paragraph 5 should be changed to start: "By execution of image control statements, optionally combined with user-defined ordering (8.5.4), ..." Currently, the normative text is slightly contradictory over whether a SYNC IMAGES statement can be used instead of a SYNC MEMORY one for user-defined ordering with another image that is not in the SYNC IMAGES set; currently, 8.5.1 paragraph 5 and 8.5.4 paragraph 1 imply that it cannot be, and 8.5.4 paragraph 2 implies that it can be. This explicitly permits code like the following: Image 1: ! segment P SYNC IMAGES ( (/ 2 /) ) CALL UNLOCK("whatever") Image 9: CALL LOCK("whatever") SYNC MEMORY ! segment Q The SYNC IMAGES has the same effect as a SYNC MEMORY in ensuring that segment P precedes segment Q. 4) There should be two new 8.5.4 paragraphs, 3 and 4, along the following lines: "User-defined ordering of segment Pi on image P to precede segment Qj on image Q takes the following form: Image P executes an image control statement which ends segment Pi, and then executes a statement that performs a synchronisation Zij between images P and Q. Image Q executes a statement that performs the same synchronisation Zij, and then executes an image control statement which starts segment Qj. The mechanisms that may be used for synchronisation are processor dependent." This specifies precisely what user-defined ordering means in terms of statements, and that the supported mechanisms for the latter are processor dependent. Because there are several reasonable interpretations compatible with the current words, this needs to be normative. 5) There should be a new NOTE in 8.5.4, along the following lines: "NOTE 8.37+ A processor should include at least the following as potential user-defined ordering mechanisms: Closing a unit connected to an external file, followed by opening a unit connected to an external file. Writing to a unit connected to an external file, followed by executing the FLUSH statement for it, followed by reading from a unit connected to an external file. Two calls to impure subroutines that are provided by a companion processor." This clarifies that an implementor should attempt to support at least the mechanisms most widely used by current parallel codes. As N1744 mentions, there is an example of the last in a NOTE. 6) While discussing paper N1744 via Email, the first author realised that he had made a mistake in including the issue of 'progress' together with user-defined ordering. While the combination is by far the most likely to cause deadlock in practice, and is resolved by proposal (4) above, the issue can arise even with SYNC IMAGES. The question is whether images P and Q can communicate through a coarray on image R, irrespective of what R is doing at the time. This is extremely hard to implement on some systems, at least when R is in a call to a companion processor, performing I/O or in a long-running 'pure' CPU loop. Of the widespread parallel interfaces, only MPI has specified exactly when progress is required in normative text. For example: PROGRAM Progress INTEGER :: one[*] = 0 SELECT CASE (THIS_IMAGE()) CASE(1) one[9] = 123+one[8] SYNC IMAGES ( (/ 2 /) ) CASE(2) SYNC IMAGES ( (/ 1 /) ) PRINT *, one[9] CASE(8) one[2] = 456+one[1] SYNC IMAGES ( (/ 9 /) ) CASE(9) SYNC IMAGES ( (/ 8 /) ) PRINT *, one[1] END SELECT END PROGRAM Progress Consider a processor where an image services requests for coarray data that it owns only when it reaches an image control statement; this is common for MPI, and is also done by the reference implementation of UPC. The above program will deadlock, because image 1 will not reach its SYNC IMAGES until after images 8 and 9 have responded, and image 8 will not reach its SYNC IMAGES until after images 1 and 2 have responded. Obviously, that is a poor implementation of coarrays, but that is not the point at issue. The question is whether it is a conforming processor in the sense of 1.4 paragraph 2. At the very least, there should be a NOTE in 8.5.1, along the following lines: "NOTE 8.32+ Where segment Pi accesses a coindexed object on image Q, but image Q is executing certain statements, the access may be delayed until image Q is free to service the access; this can cause deadlock in some programs. However, processors should attempt to complete the access as soon as possible, irrespective of what statement segment Q is executing at the time, and avoid such deadlock." It would be better to have normative text that requires progress to be made, at least when image Q next reaches an image control statement, and probably even while it is in a 'pure CPU loop', but that is considerably harder to do.