ISO/IEC JTC1/SC22/WG5 N2038

             Result of the WG5 straw ballot on N2033

                         John Reid

N2035 asked this question

Please answer the following question "Is N2033 ready for forwarding to 
SC22 as the DTS?" in one of these ways. 

1) Yes.
2) Yes, but I recommend the following changes. 
3) No, for the following reasons.
4) Abstain.

The numbers of answers in each category were:
1 for 1) Yes (Chen)
5 for 2) Yes, but I recommend the following changes
         (Bader, Long, Nagle, Reid, Whitlock)
4 for 3) No, for the following reasons 
        (Cohen, Corbett, Snyder, Maclaren)
2 for 4) Abstain (Moene, Muxworthy)

The straw ballot has failed - consensus has not been reached. 
However, I believe that we may be able to reach consensus with
far less changes than have been made at recent J3 meetings. Therefore, 
I request that the co-array email group, led by the Editor Bill Long,
consider all the comments and prepare a fresh version by 31 December. 
I will then conduct a 14-day straw vote on the new version in January. 

Here are the comments and reasons. I have included the comments of
Tobias Burnus, who did not vote. 

Reinhold Bader

2) Yes, but I recommend the following changes. 

(A) Section 5.9 

Now that the TS has the concept of stalled images, I think that image
control statements without a STAT= specification that involve a
failed image could now relatively easily be made to result in the 
executing image becoming stalled, instead of terminating the program. 
This would make development of fail-safe packages much easier, 
because the fail-safety can be designed in a top-down manner i.e. 
library code that synchronizes, allocates or deallocates must not 
necessarily be modified.

Suggested edits to N2033:

[14:19-] Add a new paragraph
    "If an <<image-selector>> identifies an image that has failed
     and a corresponding team, the executing image becomes a stalled
     image."
    [[textually separate identification from consequences. This is 
     not only needed for image control statements, but also atomic
     and collective invocations without a STAT.]]

[14:19] Replace "If an ... stalled image" by
    "If an image has stalled with respect to a team other than 
     the initial team, it remains stalled"

[14:24] Replace "If an ... stalled image" by
    "If an image has stalled with respect to the initial team, 
     it remains stalled"

[36:36+] Replace "or an image ... initiated." by
    "error termination is initiated. Otherwise, if an image involved
     in the execution of the statement has stalled or failed, the 
     executing image becomes stalled. The stalled image's team
     is the current team if an END TEAM, FORM TEAM, SYNC ALL, 
     SYNC MEMORY, or SYNC IMAGES statement is executed; it is the team 
     specified by the value of the <<team-variable>> in the execution 
     of a CHANGE TEAM or SYNC TEAM statement, and it is the team 
     identified by the <<image-selector>> in execution of a LOCK, 
     UNLOCK or EVENT POST statement."

[45:14-17] Replace para by
    "If the implementation is capable of managing stalled images, 
     this example will continue execution in the face of failing
     images even if synchronization statements, collective 
     or atomic subroutine invocations, or coarray allocations and 
     deallocations inside the change team block do not specify a 
     STAT argument." 

(B) Section 7.2

It is not specified what happens if no STAT argument is specified.

Suggested edit:

[17:32] Add new para

"If no STAT argument argument is present in an invocation of an 
 atomic subroutine and the coindexed argument is determined to 
 be located on a failed image, the executing image becomes stalled; 
 the team is that identified by the <<image-selector>>.
 Otherwise, if an error condition occurs, error termination is 
 initiated."

(C) Section 7.3

Here, the case without a STAT argument needs modification.

[17:24-25] Replace para by

"If no STAT argument argument is present in an invocation of a
 collective subroutine and a failed or stalled image is identified
 in the current team, the executing image becomes stalled with 
 respect to that team, and the argument A becomes undefined; 
 otherwise, if an error condition occurs, error termination is 
 initiated."


(D) Collective intrinsics CO_BROADCAST (7.4.10) and
    CO_REDUCE (7.4.13)

There still seems to be some missing text with respect to invoking
these intrinsics with objects of derived type that have POINTER
components. 

Here are suggestions for edits:

[22:24] Before "A becomes defined", insert "Except for ultimate
POINTER components, ".
[22:25] After "SOURCE_IMAGE.", add " The association status
and value of any ultimate POINTER component of A is not changed."

[24:5] After "computed value" add ", except for ultimate POINTER
components of A, "

[24:6] After "team.", add " The association status and value of 
any ultimate POINTER component of A is not changed."

[24:14] After "operation.", add " The implementation of OPERATOR
shall not perform an ALLOCATE statement on any ultimate POINTER
component of the function result."

(Tobias Burnus has requested that finalizers be executed for both
 the A argument as well as the OPERATOR function result. The above 
 edits constitute an attempt to do without this, inasmuch as
 we're talking about INTENT(INOUT) arguments, while finalizers 
 are normally only executed for INTENT(OUT). 
 If his suggestion is followed instead, it should be noted that the
 finalizers must be PURE procedures, because the intrinsics are; 
 allowing the A argument of CO_BROADCAST to be polymorphic would 
 then also be precluded, because the PUREity of the actually 
 executed finalizer could not be determined).

(E) Section 7.5.3 (MOVE_ALLOC)

Edit for support of stalling if executed without STAT argument:

[30:4-5] Replace para by
"If no STAT argument argument is present in an invocation of MOVE_ALLOC,
 and a failed or stalled image is identified in the current team, the 
 executing image becomes stalled with respect to that team; otherwise, 
 if an error condition occurs, error termination is initiated."
_______________________________________________________________________

Tobias Burnus

First, thanks for the work in the draft. One item I want to raise now 
before I forget it or it is passed 8 December:

The DTS does not address finalization of CO_BROADCAST and CO_REDUCE for 
derived types which have finalizers.

For CO_BROADCAST, simply adding a statement like the following should be 
sufficient and implementation wise, it should be simple as one can 
simply finalize it before the actual data transfer: In the description 
of "A" append: "On all images of the current team but on the image 
specified by SOURCE_IMAGE, A is finalized before it becomes defined."

For CO_REDUCE, the implementation will be more difficult; still, I 
believe it makes sense to require finalization. Possible wording: "If 
RESULT_IMAGE is not present, A is finalized and the computed value is 
assigned to A on all images in the current team. If RESULT_IMAGE is 
present, A is finalized and the computed value is assigned to A on image 
RESULT_IMAGE and A on all other images in the current team is finalized 
and becomes undefined."

This might need some refinement as also intermediate results ("tmp = 
operator(a,b)") have to be finalized at some point – assuming that "A" 
is used for those – and I am not sure whether that's already implied.

______________________________________________________________________

Malcolm Cohen

NO, for the following reasons.

I agree with Robert Corbett's vote.

I am somewhat taken aback that we've suddenly added this new concept 
(stalled images) with far-reaching effects (and more proposed in other 
comments) at the last minute.

It needs to be clear that it is possible to implement the "reliability" 
(failed/stalled/whatever image) features efficiently on a variety of 
architectures.  It should not require incompatible changes to an existing 
coarray implementation (which the current draft certainly seems to do).  I 
have no problem with some "bells and whistles" potentially requiring extra 
work, but a reasonably effective subset needs to be workable without heroic 
efforts, and without affecting programs that do not use the feature.

Additional minor comment:
Re finalization, I agree with Tobias Burnus' comments that it would be good 
for this to be spelled out in detail for CO_BROADCAST and CO_REDUCE.  For 
the latter it should say that the result of applying the function is 
finalized, including the final function application, (the latter is as if 
the output variable were assigned an expression that is the last function 
reference).  It should, I think, also be stated that the finalizations of 
the intermediate function results are done on the image that actually 
invoked the function, so that any deallocations are handled by the image 
that did the allocations.
______________________________________________________________________

Robert Corbett
 
My vote is "3) No, for the following reasons."

I voted "yes" or "abstain" on recent ballots regarding the draft TS
because the features specified in the drafts ranged from good to
tolerable and because I thought it would be good to have the TS
completed so that implementors could gain experience with the
features before they became part of an edition of the Fortran
standard.  The addition of stalled images in their present form
is sufficiently objectionable that I am compelled to vote "No."

My primary objection is to the requirements given in the third
paragraph of Clause 5.9 [14:19-23].  I do not see how the specified
semantics can be implemented without compromising the performance of
codes that do not make use of the feature.  I am not certain that
the semantics can be implemented at all in some common environments.

At a secondary level, I find the specification of stalled images to
be unclear.  Some points follow.

Clause 3.7 [5:41]
What does it mean for an image to have "encountered" an
<image-selector>?  I know we use the usual meaning of a word when
we do not specify its meaning, but that rule is inadequate for this
case.  For example, if an image executes a statement that contains an
<image-selector>, but that <image-selector> is part of an operand that
is not evaluated, has the <image-selector> been "encountered?"  My
guess is that it has not, but I cannot tell that from the draft TS.

Clause 5.9, paragraph 3 [14:20-23]
When does a stalled image transfer control to the END TEAM statement?
Can it happen immediately or must it wait until all other images that
are part of the same team have completed, failed, or stalled?

Clause 5.9, paragraph 3 [14:20-23]
Are the deallocations and finalizations subject to any requirements
w.r.t. the order in which they are performed?  For example, during
normal execution, allocatable objects that are part of an instance of
an internal procedure will be deallocated before the allocatable
objects that are part of the related instance of the host procedure.
Is there any requirement that that ordering be respected by the
stalled image?
_____________________________________________________________________

Bill Long
 
Yes, but I recommend the following changes.

N2033: [14:23] Delete ",without synchronization of coarray deallocations".

Tom Clune, and others since, have noted that this phrase increases the 
uncertainty of how the recovery of a stalled image is expected to be 
implemented.  Additionally, it conflicts with a basic tenant of coarrays 
that the existence of a coarray should be consistent across the images 
where the coarray was allocated   If a stalled image prematurely 
deallocates a coarray, accesses from an active image might produce 
nonsense results, or even fail.  This would be an undesirable exception 
to our normal rules. 

-------------------------

Additional general comments:

Nick explained the rationale behind the stalled image classification.  
I would just add one background note.  Most of the modes of inter-image 
activity involve statements (image control statements or calls to 
intrinsics) that have an optional STAT= specifier or STAT argument.  
In those cases, an abnormal state can be detected by a programmer and 
explicitly acted upon with statements in the program.  If the program 
fails to use these facilities (no STAT= specified, or omits the optional 
STAT argument) and an error condition occurs, the program aborts, as has 
long been the case.   The one exception to this model is a simple 
reference or definition of a variable on a remote image using the 
image-selector syntax.   There is no “STAT” method available there, nor 
would it make much sense, since the designator that includes the image 
selector could be in many places of a complicated expression or statement.  
The stalled image facility addresses this case, plugging an otherwise 
serious hole. 

There is substantial opinion that implementing stalled image recovery 
is not easy. I do not disagree.  In simplest terms, it is equivalent to 
implementing the infrastructure to handle an exception handling mechanism.  
It is a bit simpler - the handler is basically internal to the runtime 
rather than user-specified, and if the relevant END TEAM statement lacks 
a STAT= specifier, the code would end up aborting anyway, so there is no 
need to do much before then.  However, the basic process of unwinding the 
call stack (if there is one) that grew after the CHANGE TEAM statement 
execution is more or less the same as for an exception handler.  Given 
that exception handlers already exist in other languages, and certainly 
at the system level, the argument that implementors do not know how to 
do this seems weak at best.  I understand grumbling about hard work, not 
claims of inability. 

The more general question of whether Fortran should include fault 
tolerance on a timely schedule at all is really a question Fortran’s 
future relevance in the HPC market place. And that is the only market 
where Fortran has a significant fraction of programming language mindshare.  
The need for this capability is in the 2018-2020 “exascale” time frame.  
If we miss that window, we’re seriously disadvantaged. The Fortran 2015 
standard (with compilers available ~2018) is our last opportunity to meet 
the schedule.  Alternatives like MPI and SHMEM are actively making progress 
in this area, realizing the same target dates are looming. 

The idea that vendors need to implement a facility like fault tolerance 
before including it in the standard is out of touch with the realities of 
modern-day compiler development.  It might have been viable in the past, 
but today’s compiler vendors will implement a feature AFTER is it in the 
standard, not before.  Not only is this an economic reality, but also a 
positive for program portability.  In many cases from the past where 
vendors implement new facilities outside the standard, the features end 
up being “extensions” that don’t go away but perpetually lead to 
non-portable code for programmers who use them.  On platforms with 
multiple Fortran compilers, this is a recurring frustration. 

Finally, Tobias raised,  and Malcolm elaborated and provided details on 
the issue of finalization in the context of CO_BROADCAST and (especially) 
CO_REDUCE.  This issue is a side effect of the introduction of intrinsic 
subroutines that allow INTENT(INOUT) arguments of types that specify 
finalization.  This case was not envisioned  (or relevant) when the 
current  "4.5.6.3 When finalization occurs” was written. Modification to 
the TS to account for this would be in Clause 8.  I see this as 
essentially an integration issue.  While this is important,  the TS 
process also does allow for subsequent modifications during integration, 
so I don’t see this as an issue that should block the TS from progressing 
to a vote. 

_______________________________________________________________________

Nick Maclaren

NO, for the following reasons.

Reason 1
--------

I agree with Robert Corbett and Malcolm Cohen about stalled images, but
believe that they have understated the issue.  The requirement is to
handle the 'knock-on' effect of image A failing, image B getting stuck
as a consequence, and image C then needing to interact with image C.  I
agree with the authors that the concept is essential if support is to be
provided for failed images, and that is one of the reasons that I have
consistently voted against the whole feature or abstained.

I have implemented error recovery in run-time systems, have used and
worked on it in several contexts, and know that I am not smart enough to
specify it for a language like Fortran.  Of the thousand or so language
and environment specifications I have seen, I have never seen one
specify this successfully, even for a single environment.  It might be
possible in Haskell, but Fortran is not Haskell.  From the lack of
convergence of these documents and the comments on the mailing list,
this TS seems to be failing in the ways that so many others have failed
before it.  It is doubtful that adding this facility takes "full account
of the state of the art" (see the ISO Directives).

I believe that there is no chance whatsoever that this issue can be
resolved, and WG5 still keep to the schedule agreed in Las Vegas (see
N2020 and N2024).  Indeed, I doubt that it could be done with even a
year's delay.  Solving this problem is not within the state of the art,
despite considerable efforts in a great many contexts over the past
half-century.

I believe that the whole feature of support for failing and stalled
images should be removed, possibly specified in another TS, and not
integrated until there is significant implementation and user experience
in a fairly wide variety of environments.

Reason 2
--------

Many or most of the comments in N2013 on events have still not been
addressed, nor have some of ones on atomics and collectives.  In
particular, there are assumptions of cross-facility coherence and
progress but no normative text requiring them - indeed, quite the
opposite.  It is doubtful that the current TS is "consistent, clear and
accurate" (see the ISO Directives).  This is extremely serious, as
adopting an inconsistent set of assumptions will make it almost
impossible to deliver the target specified in 1. Scope, paragraph 2,
even ignoring the problem of the schedule.

I do not believe that this issue is as intractable as the previous one,
because specifying data and control flow and progress are within "the
state of the art".  However, I am doubtful that the facilities in TS can
be implemented efficiently without special hardware or operating system
support, while still delivering the consistency and progress that seem
to be assumed.  However, even if there are no consistency problems to be
resolved, I do not believe that accepting these aspects of this TS is
compatible with keeping to the agreed schedule.

I believe that this area needs further clarity, even if not a polished
specification, before the TS should be accepted.  I am not repeating the
relevant comments in N2013, because there is little point - there has
been little relevant change to the drafts.
___________________________________________________________________

Dan Nagle

Yes, but I recommend the following change.

[27:14-15] change "a nonzero value" to "a positive value"

Error values are positive.
__________________________________________________

John Reid

Yes, but I recommend the following changes.

[10:19] Delete "or be the value of a team variable for the initial
team".
Reason. Execution of FORM TEAM is always required.

[10:38], [12:21], [29:34], [34:8], [34:13]. At the end of the sentence
add "since execution last began in this team" (wavy underlined on page
34).
Reason. We need to allow for teams changing during the execution of the
program. At the October meeting, these words were added at [13:5],
[35:26], [35:37], and [36:1].

[13:1] Change "the team" to "team".
Reason. Definite article is wrong here.

[13:5] Remove space before period.

[14:9] Change "detect that an image has stalled" to "manage a stalled
image".
[14:20] After "becomes a stalled image" add ". If the processor does
not have the ability to manage a stalled image, the executing image
becomes a stalled image for the rest of the execution of the program.
If the processor has the ability to manage a stalled image, the
executing image becomes a stalled image"
Reason. I think the intention is to allow implementations not to
support stalled images transferring control to the END TEAM statement.
Stalling will still happen and will need to be permanent.
_______________________________________________________________________

Van Snyder

No, for similar reasons to Robert Corbett, Malcolm Cohen, and Nick
MacLaren.  I am concerned that we have added incompletely
thought-through things at the last minute.
_______________________________________________________________________

Stan Whitlock

Yes, but I recommend the changes in Bill Long and John Reid’s ballots.