ISO/IEC JTC1/SC22/WG5 N2039 Response to the WG5 straw ballot on N2033 Bill Long and John Reid This paper contains responses to the comments in the WG5 straw ballot on N2033 (see N2038) and a set of edits to N2033. Reinhold Bader (A) wrote "Now that the TS has the concept of stalled images, I think that image control statements without a STAT= specification that involve a failed image could now relatively easily be made to result in the executing image becoming stalled, instead of terminating the program. This would make development of fail-safe packages much easier, because the fail-safety can be designed in a top-down manner i.e. library code that synchronizes, allocates or deallocates must not necessarily be modified." Response We see the inclusion of a STAT= specification in an image control statement as desirable and believe that this will not have a big effect on performance. This is in contrast to protecting against remote references on failed images, for which there is no syntax and which would have performance overheads. Furthermore, we now have a fairly simple model of image control statements without a STAT= specifier - if successful they form segment boundaries and can be used for segment ordering and hence data consistency, or they fail and the program aborts, making data consistency issues irrelevant. Therefore, we do not favor the extension of stalling to image control statements without a STAT= specification. ....... Reinhold Bader (B) wrote (B) Section 7. It is not specified what happens if no STAT argument is specified. Response We suggest this edit (copied from [18:24-25]) [17:32+] Add new para "If a condition occurs that would assign a nonzero value to a STAT argument but the STAT argument is not present, error termination is initiated." ....... Reinhold Bader (D) wrote (D) Collective intrinsics CO_BROADCAST (7.4.10) and CO_REDUCE (7.4.13) There still seems to be some missing text with respect to invoking these intrinsics with objects of derived type that have POINTER components. Response He is concerned about a pointer appearing to be associated with a target on a remote image. This can already happen in Fortran 2008 and such a pointer is regarded as undefined, see 16.5.2.5 of the Standard: "The association status of a pointer becomes undefined when ... (2) the pointer is pointer-assigned to a target on a different image," No edits to N2033 are needed. ....... Reinhold Bader (C) and (E) are relevant only if the change proposed in (A) is accepted. Thus, there are no separate responses for these comments. ....... Tobias Burnus wrote The DTS does not address finalization of CO_BROADCAST and CO_REDUCE for derived types which have finalizers. For CO_BROADCAST, simply adding a statement like the following should be sufficient and implementation wise, it should be simple as one can simply finalize it before the actual data transfer: In the description of "A" append: "On all images of the current team but on the image specified by SOURCE_IMAGE, A is finalized before it becomes defined. For CO_REDUCE, the implementation will be more difficult; still, I believe it makes sense to require finalization. Response For CO_BROADCAST, we already have the words [22:24] "A becomes defined, as if by intrinsic assignment, ..." and the standard states in 4.5.6.3, para 9: "When an intrinsic assignment statement is executed, the variable is finalized after evaluation of expr and before the definition of the variable." Therefore, no change is needed. For CO_REDUCE, these edits are needed: [24:22] After "to A" add "as if by intrinsic assignment". [24:23] After "to A" add "as if by intrinsic assignment". ....... Malcolm Cohen wrote I agree with Robert Corbett's vote. I am somewhat taken aback that we've suddenly added this new concept (stalled images) with far-reaching effects (and more proposed in other comments) at the last minute. It needs to be clear that it is possible to implement the "reliability" (failed/stalled/whatever image) features efficiently on a variety of architectures. It should not require incompatible changes to an existing coarray implementation (which the current draft certainly seems to do). I have no problem with some "bells and whistles" potentially requiring extra work, but a reasonably effective subset needs to be workable without heroic efforts, and without affecting programs that do not use the feature. Response See our responses below to Robert Corbett's vote. ....... Malcolm Cohen wrote Re finalization, I agree with Tobias Burnus' comments that it would be good for this to be spelled out in detail for CO_BROADCAST and CO_REDUCE. For the latter it should say that the result of applying the function is finalized, including the final function application, (the latter is as if the output variable were assigned an expression that is the last function reference). It should, I think, also be stated that the finalizations of the intermediate function results are done on the image that actually invoked the function, so that any deallocations are handled by the image that did the allocations. Response See our responses below to Tobias Burnus' comments. ....... Robert Corbett wrote My primary objection is to the requirements given in the third paragraph of Clause 5.9 [14:19-23]. I do not see how the specified semantics can be implemented without compromising the performance of codes that do not make use of the feature. I am not certain that the semantics can be implemented at all in some common environments. Response The intention was that implementations be permitted not to support transfer of control to the END TEAM statement, but the present wording does not say this. We think these edits are needed to address this comment and your other comments: [14:9] Change "detect that an image has stalled" to "manage a stalled image". [14:19-25] Replace these two paragraphs by the following two paragraphs: "If an image, in a statement other than an image control statement or an invocation of a collective or atomic subroutine, attempts to reference or define data using an that identifies an image that has failed, the executing image becomes a stalled image. If the identifies the initial team or the processor does not have the ability to manage a stalled image, the executing image remains a stalled image for the rest of the execution of the program. Otherwise, the executing image resumes execution at the END TEAM statement of the construct after execution of all finalizations and deallocations that would have occurred during the normal completion of active procedures that were invoked within the CHANGE TEAM construct. While an image is stalled, other images can still access data on that image. If an image is stalled in the initial team, it participates in normal termination as if it had initiated normal termination." ....... Robert Corbett wrote Clause 3.7 [5:41] What does it mean for an image to have "encountered" an ? I know we use the usual meaning of a word when we do not specify its meaning, but that rule is inadequate for this case. For example, if an image executes a statement that contains an , but that is part of an operand that is not evaluated, has the been "encountered?" My guess is that it has not, but I cannot tell that from the draft TS. Response Your guess is correct. Together with the rewrite of [14:19-25] in our response to your first comment, these edits are needed: [5:41] Change "has encountered" to ", in a statement other than an image control statement or an invocation of a collective or atomic subroutine, attempts to reference or define data using". [32:14] Change "has encountered" to ", in a statement other than an image control statement or an invocation of a collective or atomic subroutine, attempts to reference or define data using". ....... Robert Corbett wrote Clause 5.9, paragraph 3 [14:20-23] When does a stalled image transfer control to the END TEAM statement? Can it happen immediately or must it wait until all other images that are part of the same team have completed, failed, or stalled? Response In order to preserve symmetric memory, it would be necessary for the stalled image to participate in coarray deallocations. Also, it is intended that data on the stalled image remain available to executing images. The edits needed are included in the rewrite of [14:19-25] in our response to your first comment. ....... Robert Corbett wrote Clause 5.9, paragraph 3 [14:20-23] Are the deallocations and finalizations subject to any requirements w.r.t. the order in which they are performed? For example, during normal execution, allocatable objects that are part of an instance of an internal procedure will be deallocated before the allocatable objects that are part of the related instance of the host procedure. Is there any requirement that that ordering be respected by the stalled image? Response Yes, the order should be respected. The edits needed are included in the rewrite of [14:19-25] in our response to your first comment. ....... Bill Long wrote I recommend the following change. N2033: [14:23] Delete ',without synchronization of coarray deallocations'. Response This edit has been included in the response to Robert Corbett. ....... Nick Maclaren wrote I agree with Robert Corbett and Malcolm Cohen about stalled images, but believe that they have understated the issue. The requirement is to handle the 'knock-on' effect of image A failing, image B getting stuck as a consequence, and image C then needing to interact with image C. I agree with the authors that the concept is essential if support is to be provided for failed images, and that is one of the reasons that I have consistently voted against the whole feature or abstained. Response I think you mean "... image C then needing to interact with image B." As part of our response to Robert Corbett, we have added this sentence "While an image is stalled, other images can still access data on that image." An implementation can treat a stalled image as being very like an image that as initiated normal termination. ....... Nick Maclaren wrote I have implemented error recovery in run-time systems, have used and worked on it in several contexts, and know that I am not smart enough to specify it for a language like Fortran. Of the thousand or so language and environment specifications I have seen, I have never seen one specify this successfully, even for a single environment. It might be possible in Haskell, but Fortran is not Haskell. From the lack of convergence of these documents and the comments on the mailing list, this TS seems to be failing in the ways that so many others have failed before it. It is doubtful that adding this facility takes "full account of the state of the art" (see the ISO Directives). Response Perhaps you were aiming for too perfect a system. It is not intended that all possible failure scenarios be covered. For example, NOTE 5.9 explains that it might be impossible to recover from failure of image 1. ....... Nick Maclaren wrote I believe that there is no chance whatsoever that this issue can be resolved, and WG5 still keep to the schedule agreed in Las Vegas (see N2020 and N2024). Indeed, I doubt that it could be done with even a year's delay. Solving this problem is not within the state of the art, despite considerable efforts in a great many contexts over the past half-century. I believe that the whole feature of support for failing and stalled images should be removed, possibly specified in another TS, and not integrated until there is significant implementation and user experience in a fairly wide variety of environments. Response We would like to remind you that support of failed images is not required of the processor. It is our belief that agreeing to remove the feature would lead to several "no" votes and that we have to "agree to disagree". ....... Nick Maclaren wrote Many or most of the comments in N2013 on events have still not been addressed, nor have some of ones on atomics and collectives. In particular, there are assumptions of cross-facility coherence and progress but no normative text requiring them - indeed, quite the opposite. It is doubtful that the current TS is "consistent, clear and accurate" (see the ISO Directives). This is extremely serious, as adopting an inconsistent set of assumptions will make it almost impossible to deliver the target specified in 1. Scope, paragraph 2, even ignoring the problem of the schedule. I do not believe that this issue is as intractable as the previous one, because specifying data and control flow and progress are within "the state of the art". However, I am doubtful that the facilities in TS can be implemented efficiently without special hardware or operating system support, while still delivering the consistency and progress that seem to be assumed. However, even if there are no consistency problems to be resolved, I do not believe that accepting these aspects of this TS is compatible with keeping to the agreed schedule. I believe that this area needs further clarity, even if not a polished specification, before the TS should be accepted. I am not repeating the relevant comments in N2013, because there is little point - there has been little relevant change to the drafts. Response It would be very helpful to have explicit suggestions for edits. These could then be considered for inclusion. We urge Nick Maclaren to provide suggested edits through his national body. ....... Dan Nagle wrote I recommend the following change. [27:14-15] change "a nonzero value" to "a positive value" Error values are positive. Response We agree with this edit. ....... John Reid wrote I recommend the following changes. [10:19] Delete "or be the value of a team variable for the initial team". Reason. Execution of FORM TEAM is always required. Response The error lies in the first part of this sentence: "The shall have been defined by execution of a FORM TEAM statement in the team that executes the CHANGE TEAM statement". It was intended that other means of defining the team variable, including the use of GET_TEAM, be permitted. We therefore suggest this edit: [10:18] Change "The shall have been defined" to "The value of shall be the value of a team variable defined". Further edits are needed to allow for the case where the team is the initial team: [10:20] Change "those defined" to "those of team variables defined". [10:21] Change "team." to "team or be the values of a team variable for the initial team." [11:9] After "designator" add "and the current team is not the initial team". [11:12] After "TEAM_ID." add "If TEAM_ID= appears in a coarray designator and the current team is the initial team, the value of is ignored." ....... John Reid wrote I recommend the following changes. [10:38], [12:21], [29:34], [34:8], [34:13]. At the end of the sentence add "since execution last began in this team" (wavy underlined on page 34). Reason. We need to allow for teams changing during the execution of the program. At the October meeting, these words were added at [13:5], [35:26], [35:37], and [36:1]. Response We agree with these edits, except that the edit for [10:38] should be [10:38] At the end of the sentence add "since execution last began in the team that was current before execution of the CHANGE TEAM statement" and this edit is needed at [10:41]: [10:41] At the end of the sentence add "since execution last began in the team that was current before execution of the corresponding CHANGE TEAM statement". Malcolm Cohen later drew our attention to the fact that the concept of construct completion, used at [10:32] and in the new text for [14:19-25] in our first response to Robert Corbett, has not been defined. We therefore propose this further edit: [10:23] At the end of the paragraph, add the sentence "A CHANGE TEAM construct completes execution by executing its END TEAM statement." .............. John Reid wrote I recommend the following changes. [13:1] Change "the team" to "team". Reason. Definite article is wrong here. [13:5] Remove space before period. Response We agree with these edits. ....... John Reid wrote I recommend the following changes. [14:9] Change "detect that an image has stalled" to "manage a stalled image". [14:20] After "becomes a stalled image" add ". If the processor does not have the ability to manage a stalled image, the executing image becomes a stalled image for the rest of the execution of the program. If the processor has the ability to manage a stalled image, the executing image becomes a stalled image" Reason. I think the intention is to allow implementations not to support stalled images transferring control to the END TEAM statement. Stalling will still happen and will need to be permanent. Response These edits have been superseded by the response to Robert Corbett. ....... Van Snyder wrote No, for similar reasons to Robert Corbett, Malcolm Cohen, and Nick Maclaren. I am concerned that we have added incompletely thought-through things at the last minute. Response See the responses to Robert Corbett, Malcolm Cohen, and Nick Maclaren. Stan Whitlock Yes, but I recommend the changes in Bill Long and John Reid's ballots. Response See the responses to Bill Long and John Reid.