ISO/IEC JTC1/SC22/WG5 N2038 Result of the WG5 straw ballot on N2033 John Reid N2035 asked this question Please answer the following question "Is N2033 ready for forwarding to SC22 as the DTS?" in one of these ways. 1) Yes. 2) Yes, but I recommend the following changes. 3) No, for the following reasons. 4) Abstain. The numbers of answers in each category were: 1 for 1) Yes (Chen) 5 for 2) Yes, but I recommend the following changes (Bader, Long, Nagle, Reid, Whitlock) 4 for 3) No, for the following reasons (Cohen, Corbett, Snyder, Maclaren) 2 for 4) Abstain (Moene, Muxworthy) The straw ballot has failed - consensus has not been reached. However, I believe that we may be able to reach consensus with far less changes than have been made at recent J3 meetings. Therefore, I request that the co-array email group, led by the Editor Bill Long, consider all the comments and prepare a fresh version by 31 December. I will then conduct a 14-day straw vote on the new version in January. Here are the comments and reasons. I have included the comments of Tobias Burnus, who did not vote. Reinhold Bader 2) Yes, but I recommend the following changes. (A) Section 5.9 Now that the TS has the concept of stalled images, I think that image control statements without a STAT= specification that involve a failed image could now relatively easily be made to result in the executing image becoming stalled, instead of terminating the program. This would make development of fail-safe packages much easier, because the fail-safety can be designed in a top-down manner i.e. library code that synchronizes, allocates or deallocates must not necessarily be modified. Suggested edits to N2033: [14:19-] Add a new paragraph "If an <> identifies an image that has failed and a corresponding team, the executing image becomes a stalled image." [[textually separate identification from consequences. This is not only needed for image control statements, but also atomic and collective invocations without a STAT.]] [14:19] Replace "If an ... stalled image" by "If an image has stalled with respect to a team other than the initial team, it remains stalled" [14:24] Replace "If an ... stalled image" by "If an image has stalled with respect to the initial team, it remains stalled" [36:36+] Replace "or an image ... initiated." by "error termination is initiated. Otherwise, if an image involved in the execution of the statement has stalled or failed, the executing image becomes stalled. The stalled image's team is the current team if an END TEAM, FORM TEAM, SYNC ALL, SYNC MEMORY, or SYNC IMAGES statement is executed; it is the team specified by the value of the <> in the execution of a CHANGE TEAM or SYNC TEAM statement, and it is the team identified by the <> in execution of a LOCK, UNLOCK or EVENT POST statement." [45:14-17] Replace para by "If the implementation is capable of managing stalled images, this example will continue execution in the face of failing images even if synchronization statements, collective or atomic subroutine invocations, or coarray allocations and deallocations inside the change team block do not specify a STAT argument." (B) Section 7.2 It is not specified what happens if no STAT argument is specified. Suggested edit: [17:32] Add new para "If no STAT argument argument is present in an invocation of an atomic subroutine and the coindexed argument is determined to be located on a failed image, the executing image becomes stalled; the team is that identified by the <>. Otherwise, if an error condition occurs, error termination is initiated." (C) Section 7.3 Here, the case without a STAT argument needs modification. [17:24-25] Replace para by "If no STAT argument argument is present in an invocation of a collective subroutine and a failed or stalled image is identified in the current team, the executing image becomes stalled with respect to that team, and the argument A becomes undefined; otherwise, if an error condition occurs, error termination is initiated." (D) Collective intrinsics CO_BROADCAST (7.4.10) and CO_REDUCE (7.4.13) There still seems to be some missing text with respect to invoking these intrinsics with objects of derived type that have POINTER components. Here are suggestions for edits: [22:24] Before "A becomes defined", insert "Except for ultimate POINTER components, ". [22:25] After "SOURCE_IMAGE.", add " The association status and value of any ultimate POINTER component of A is not changed." [24:5] After "computed value" add ", except for ultimate POINTER components of A, " [24:6] After "team.", add " The association status and value of any ultimate POINTER component of A is not changed." [24:14] After "operation.", add " The implementation of OPERATOR shall not perform an ALLOCATE statement on any ultimate POINTER component of the function result." (Tobias Burnus has requested that finalizers be executed for both the A argument as well as the OPERATOR function result. The above edits constitute an attempt to do without this, inasmuch as we're talking about INTENT(INOUT) arguments, while finalizers are normally only executed for INTENT(OUT). If his suggestion is followed instead, it should be noted that the finalizers must be PURE procedures, because the intrinsics are; allowing the A argument of CO_BROADCAST to be polymorphic would then also be precluded, because the PUREity of the actually executed finalizer could not be determined). (E) Section 7.5.3 (MOVE_ALLOC) Edit for support of stalling if executed without STAT argument: [30:4-5] Replace para by "If no STAT argument argument is present in an invocation of MOVE_ALLOC, and a failed or stalled image is identified in the current team, the executing image becomes stalled with respect to that team; otherwise, if an error condition occurs, error termination is initiated." _______________________________________________________________________ Tobias Burnus First, thanks for the work in the draft. One item I want to raise now before I forget it or it is passed 8 December: The DTS does not address finalization of CO_BROADCAST and CO_REDUCE for derived types which have finalizers. For CO_BROADCAST, simply adding a statement like the following should be sufficient and implementation wise, it should be simple as one can simply finalize it before the actual data transfer: In the description of "A" append: "On all images of the current team but on the image specified by SOURCE_IMAGE, A is finalized before it becomes defined." For CO_REDUCE, the implementation will be more difficult; still, I believe it makes sense to require finalization. Possible wording: "If RESULT_IMAGE is not present, A is finalized and the computed value is assigned to A on all images in the current team. If RESULT_IMAGE is present, A is finalized and the computed value is assigned to A on image RESULT_IMAGE and A on all other images in the current team is finalized and becomes undefined." This might need some refinement as also intermediate results ("tmp = operator(a,b)") have to be finalized at some point – assuming that "A" is used for those – and I am not sure whether that's already implied. ______________________________________________________________________ Malcolm Cohen NO, for the following reasons. I agree with Robert Corbett's vote. I am somewhat taken aback that we've suddenly added this new concept (stalled images) with far-reaching effects (and more proposed in other comments) at the last minute. It needs to be clear that it is possible to implement the "reliability" (failed/stalled/whatever image) features efficiently on a variety of architectures. It should not require incompatible changes to an existing coarray implementation (which the current draft certainly seems to do). I have no problem with some "bells and whistles" potentially requiring extra work, but a reasonably effective subset needs to be workable without heroic efforts, and without affecting programs that do not use the feature. Additional minor comment: Re finalization, I agree with Tobias Burnus' comments that it would be good for this to be spelled out in detail for CO_BROADCAST and CO_REDUCE. For the latter it should say that the result of applying the function is finalized, including the final function application, (the latter is as if the output variable were assigned an expression that is the last function reference). It should, I think, also be stated that the finalizations of the intermediate function results are done on the image that actually invoked the function, so that any deallocations are handled by the image that did the allocations. ______________________________________________________________________ Robert Corbett My vote is "3) No, for the following reasons." I voted "yes" or "abstain" on recent ballots regarding the draft TS because the features specified in the drafts ranged from good to tolerable and because I thought it would be good to have the TS completed so that implementors could gain experience with the features before they became part of an edition of the Fortran standard. The addition of stalled images in their present form is sufficiently objectionable that I am compelled to vote "No." My primary objection is to the requirements given in the third paragraph of Clause 5.9 [14:19-23]. I do not see how the specified semantics can be implemented without compromising the performance of codes that do not make use of the feature. I am not certain that the semantics can be implemented at all in some common environments. At a secondary level, I find the specification of stalled images to be unclear. Some points follow. Clause 3.7 [5:41] What does it mean for an image to have "encountered" an ? I know we use the usual meaning of a word when we do not specify its meaning, but that rule is inadequate for this case. For example, if an image executes a statement that contains an , but that is part of an operand that is not evaluated, has the been "encountered?" My guess is that it has not, but I cannot tell that from the draft TS. Clause 5.9, paragraph 3 [14:20-23] When does a stalled image transfer control to the END TEAM statement? Can it happen immediately or must it wait until all other images that are part of the same team have completed, failed, or stalled? Clause 5.9, paragraph 3 [14:20-23] Are the deallocations and finalizations subject to any requirements w.r.t. the order in which they are performed? For example, during normal execution, allocatable objects that are part of an instance of an internal procedure will be deallocated before the allocatable objects that are part of the related instance of the host procedure. Is there any requirement that that ordering be respected by the stalled image? _____________________________________________________________________ Bill Long Yes, but I recommend the following changes. N2033: [14:23] Delete ",without synchronization of coarray deallocations". Tom Clune, and others since, have noted that this phrase increases the uncertainty of how the recovery of a stalled image is expected to be implemented. Additionally, it conflicts with a basic tenant of coarrays that the existence of a coarray should be consistent across the images where the coarray was allocated If a stalled image prematurely deallocates a coarray, accesses from an active image might produce nonsense results, or even fail. This would be an undesirable exception to our normal rules. ------------------------- Additional general comments: Nick explained the rationale behind the stalled image classification. I would just add one background note. Most of the modes of inter-image activity involve statements (image control statements or calls to intrinsics) that have an optional STAT= specifier or STAT argument. In those cases, an abnormal state can be detected by a programmer and explicitly acted upon with statements in the program. If the program fails to use these facilities (no STAT= specified, or omits the optional STAT argument) and an error condition occurs, the program aborts, as has long been the case. The one exception to this model is a simple reference or definition of a variable on a remote image using the image-selector syntax. There is no “STAT” method available there, nor would it make much sense, since the designator that includes the image selector could be in many places of a complicated expression or statement. The stalled image facility addresses this case, plugging an otherwise serious hole. There is substantial opinion that implementing stalled image recovery is not easy. I do not disagree. In simplest terms, it is equivalent to implementing the infrastructure to handle an exception handling mechanism. It is a bit simpler - the handler is basically internal to the runtime rather than user-specified, and if the relevant END TEAM statement lacks a STAT= specifier, the code would end up aborting anyway, so there is no need to do much before then. However, the basic process of unwinding the call stack (if there is one) that grew after the CHANGE TEAM statement execution is more or less the same as for an exception handler. Given that exception handlers already exist in other languages, and certainly at the system level, the argument that implementors do not know how to do this seems weak at best. I understand grumbling about hard work, not claims of inability. The more general question of whether Fortran should include fault tolerance on a timely schedule at all is really a question Fortran’s future relevance in the HPC market place. And that is the only market where Fortran has a significant fraction of programming language mindshare. The need for this capability is in the 2018-2020 “exascale” time frame. If we miss that window, we’re seriously disadvantaged. The Fortran 2015 standard (with compilers available ~2018) is our last opportunity to meet the schedule. Alternatives like MPI and SHMEM are actively making progress in this area, realizing the same target dates are looming. The idea that vendors need to implement a facility like fault tolerance before including it in the standard is out of touch with the realities of modern-day compiler development. It might have been viable in the past, but today’s compiler vendors will implement a feature AFTER is it in the standard, not before. Not only is this an economic reality, but also a positive for program portability. In many cases from the past where vendors implement new facilities outside the standard, the features end up being “extensions” that don’t go away but perpetually lead to non-portable code for programmers who use them. On platforms with multiple Fortran compilers, this is a recurring frustration. Finally, Tobias raised, and Malcolm elaborated and provided details on the issue of finalization in the context of CO_BROADCAST and (especially) CO_REDUCE. This issue is a side effect of the introduction of intrinsic subroutines that allow INTENT(INOUT) arguments of types that specify finalization. This case was not envisioned (or relevant) when the current "4.5.6.3 When finalization occurs” was written. Modification to the TS to account for this would be in Clause 8. I see this as essentially an integration issue. While this is important, the TS process also does allow for subsequent modifications during integration, so I don’t see this as an issue that should block the TS from progressing to a vote. _______________________________________________________________________ Nick Maclaren NO, for the following reasons. Reason 1 -------- I agree with Robert Corbett and Malcolm Cohen about stalled images, but believe that they have understated the issue. The requirement is to handle the 'knock-on' effect of image A failing, image B getting stuck as a consequence, and image C then needing to interact with image C. I agree with the authors that the concept is essential if support is to be provided for failed images, and that is one of the reasons that I have consistently voted against the whole feature or abstained. I have implemented error recovery in run-time systems, have used and worked on it in several contexts, and know that I am not smart enough to specify it for a language like Fortran. Of the thousand or so language and environment specifications I have seen, I have never seen one specify this successfully, even for a single environment. It might be possible in Haskell, but Fortran is not Haskell. From the lack of convergence of these documents and the comments on the mailing list, this TS seems to be failing in the ways that so many others have failed before it. It is doubtful that adding this facility takes "full account of the state of the art" (see the ISO Directives). I believe that there is no chance whatsoever that this issue can be resolved, and WG5 still keep to the schedule agreed in Las Vegas (see N2020 and N2024). Indeed, I doubt that it could be done with even a year's delay. Solving this problem is not within the state of the art, despite considerable efforts in a great many contexts over the past half-century. I believe that the whole feature of support for failing and stalled images should be removed, possibly specified in another TS, and not integrated until there is significant implementation and user experience in a fairly wide variety of environments. Reason 2 -------- Many or most of the comments in N2013 on events have still not been addressed, nor have some of ones on atomics and collectives. In particular, there are assumptions of cross-facility coherence and progress but no normative text requiring them - indeed, quite the opposite. It is doubtful that the current TS is "consistent, clear and accurate" (see the ISO Directives). This is extremely serious, as adopting an inconsistent set of assumptions will make it almost impossible to deliver the target specified in 1. Scope, paragraph 2, even ignoring the problem of the schedule. I do not believe that this issue is as intractable as the previous one, because specifying data and control flow and progress are within "the state of the art". However, I am doubtful that the facilities in TS can be implemented efficiently without special hardware or operating system support, while still delivering the consistency and progress that seem to be assumed. However, even if there are no consistency problems to be resolved, I do not believe that accepting these aspects of this TS is compatible with keeping to the agreed schedule. I believe that this area needs further clarity, even if not a polished specification, before the TS should be accepted. I am not repeating the relevant comments in N2013, because there is little point - there has been little relevant change to the drafts. ___________________________________________________________________ Dan Nagle Yes, but I recommend the following change. [27:14-15] change "a nonzero value" to "a positive value" Error values are positive. __________________________________________________ John Reid Yes, but I recommend the following changes. [10:19] Delete "or be the value of a team variable for the initial team". Reason. Execution of FORM TEAM is always required. [10:38], [12:21], [29:34], [34:8], [34:13]. At the end of the sentence add "since execution last began in this team" (wavy underlined on page 34). Reason. We need to allow for teams changing during the execution of the program. At the October meeting, these words were added at [13:5], [35:26], [35:37], and [36:1]. [13:1] Change "the team" to "team". Reason. Definite article is wrong here. [13:5] Remove space before period. [14:9] Change "detect that an image has stalled" to "manage a stalled image". [14:20] After "becomes a stalled image" add ". If the processor does not have the ability to manage a stalled image, the executing image becomes a stalled image for the rest of the execution of the program. If the processor has the ability to manage a stalled image, the executing image becomes a stalled image" Reason. I think the intention is to allow implementations not to support stalled images transferring control to the END TEAM statement. Stalling will still happen and will need to be permanent. _______________________________________________________________________ Van Snyder No, for similar reasons to Robert Corbett, Malcolm Cohen, and Nick MacLaren. I am concerned that we have added incompletely thought-through things at the last minute. _______________________________________________________________________ Stan Whitlock Yes, but I recommend the changes in Bill Long and John Reid’s ballots.